[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"article-llama-benchy-llama-bench-style-api-benchmarks-en":3,"article-related-llama-benchy-llama-bench-style-api-benchmarks-en":30,"series-tools-92a22a3d-6d0c-4884-9865-c1fe0f2e5e78":82},{"id":4,"slug":5,"title":6,"content":7,"summary":8,"source":9,"source_url":10,"author":11,"image_url":12,"cover_image":12,"category":13,"language":14,"translated_content":11,"related_article_id":15,"keywords":16,"key_takeaways":22,"views":26,"created_at":27,"published_at":28,"topic_cluster_id":29},"92a22a3d-6d0c-4884-9865-c1fe0f2e5e78","llama-benchy-llama-bench-style-api-benchmarks-en","llama-benchy brings llama-bench tests to APIs","\u003Cp data-speakable=\"summary\">llama-benchy benchmarks \u003Ca href=\"\u002Ftag\u002Fopenai\">OpenAI\u003C\u002Fa>-compatible APIs with llama-bench-style depth tests and latency metrics.\u003C\u002Fp>\u003Cp>\u003Ca href=\"https:\u002F\u002Fgithub.com\u002Feugr\u002Fllama-benchy\" target=\"_blank\" rel=\"noopener\">llama-benchy\u003C\u002Fa> is a benchmarking tool for OpenAI-compatible model endpoints that tries to answer a question most teams still hand-wave: how fast is this model when context gets longer, requests pile up, and the server starts caching? The project currently has 451 stars, 42 forks, and 96 commits, and its README makes a clear claim that it measures performance in a way closer to real API usage than engine-only tests.\u003C\u002Fp>\u003Cp>That matters because model speed is rarely one number. Prompt processing, \u003Ca href=\"\u002Ftag\u002Ftoken\">token\u003C\u002Fa> generation, time to first response, and concurrency all change depending on the backend, the prompt shape, and whether the server reuses cache. llama-benchy tries to put those pieces into one CLI.\u003C\u002Fp>\u003Ctable>\u003Cthead>\u003Ctr>\u003Cth>Metric\u003C\u002Fth>\u003Cth>What it measures\u003C\u002Fth>\u003Cth>Example from README\u003C\u002Fth>\u003C\u002Ftr>\u003C\u002Fthead>\u003Ctbody>\u003Ctr>\u003Ctd>pp\u003C\u002Ftd>\u003Ctd>Prompt processing speed\u003C\u002Ftd>\u003Ctd>2048-token prompt at depths from 0 to 32768\u003C\u002Ftd>\u003C\u002Ftr>\u003Ctr>\u003Ctd>tg\u003C\u002Ftd>\u003Ctd>Token generation speed\u003C\u002Ftd>\u003Ctd>32 generated tokens in the sample run\u003C\u002Ftd>\u003C\u002Ftr>\u003Ctr>\u003Ctd>depth\u003C\u002Ftd>\u003Ctd>Context length under test\u003C\u002Ftd>\u003Ctd>0, 4096, 8192, 16384, 32768\u003C\u002Ftd>\u003C\u002Ftr>\u003Ctr>\u003Ctd>concurrency\u003C\u002Ftd>\u003Ctd>Parallel request load\u003C\u002Ftd>\u003Ctd>Configurable with --concurrency\u003C\u002Ftd>\u003C\u002Ftr>\u003Ctr>\u003Ctd>runs\u003C\u002Ftd>\u003Ctd>Repeated trials per test\u003C\u002Ftd>\u003Ctd>Default is 3\u003C\u002Ftd>\u003C\u002Ftr>\u003C\u002Ftbody>\u003C\u002Ftable>\u003Ch2>Why this tool exists\u003C\u002Fh2>\u003Cp>The README opens with a complaint that will feel familiar to anyone who has tried to compare model servers across stacks. \u003Ca href=\"https:\u002F\u002Fgithub.com\u002Fggerganov\u002Fllama.cpp\" target=\"_blank\" rel=\"noopener\">llama.cpp\u003C\u002Fa> has \u003Ca href=\"https:\u002F\u002Fgithub.com\u002Fggerganov\u002Fllama.cpp\u002Fblob\u002Fmaster\u002FREADME.md#benchmarking\" target=\"_blank\" rel=\"noopener\">llama-bench\u003C\u002Fa>, but that \u003Ca href=\"\u002Ftag\u002Fbenchmark\">benchmark\u003C\u002Fa> only works inside the llama.cpp world. If you are running \u003Ca href=\"https:\u002F\u002Fgithub.com\u002Fvllm-project\u002Fvllm\" target=\"_blank\" rel=\"noopener\">vLLM\u003C\u002Fa>, \u003Ca href=\"https:\u002F\u002Fgithub.com\u002Fsgl-project\u002Fsglang\" target=\"_blank\" rel=\"noopener\">SGLang\u003C\u002Fa>, or another OpenAI-compatible server, you need a different way to compare them.\u003C\u002Fp>\n\u003Cfigure class=\"my-6\">\u003Cimg src=\"https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1780775297695-nchl.png\" alt=\"llama-benchy brings llama-bench tests to APIs\" class=\"rounded-xl w-full\" loading=\"lazy\" \u002F>\u003C\u002Ffigure>\n\u003Cp>The author also calls out a practical problem with existing benchmarking flows: they can hide cache effects, misread the first response chunk as the first usable token, or make it awkward to test prompt processing at different context lengths. That is a real issue if you care about speculative decoding, multi-token prediction, or the gap between a lab benchmark and an actual chat endpoint.\u003C\u002Fp>\u003Cp>There is a subtle but important design choice here. llama-benchy does not benchmark the \u003Ca href=\"\u002Ftag\u002Finference\">inference\u003C\u002Fa> engine directly. It benchmarks the API layer that users actually hit, which means the numbers include the quirks of request handling, streaming behavior, and server-side caching.\u003C\u002Fp>\u003Cul>\u003Cli>Targets \u003Ca href=\"https:\u002F\u002Fplatform.openai.com\u002Fdocs\u002Fapi-reference\u002Fchat\" target=\"_blank\" rel=\"noopener\">\u002Fv1\u002Fchat\u002Fcompletions\u003C\u002Fa>-style endpoints\u003C\u002Fli>\u003Cli>Measures prompt processing and token generation separately\u003C\u002Fli>\u003Cli>Uses real text from Project Gutenberg for prompts\u003C\u002Fli>\u003Cli>Can run a coherence check after warmup\u003C\u002Fli>\u003Cli>Exports Markdown, JSON, or CSV\u003C\u002Fli>\u003C\u002Ful>\u003Ch2>What it measures and why that matters\u003C\u002Fh2>\u003Cp>The feature list is more useful than the usual benchmark README because it spells out the exact measurements. llama-benchy reports prompt processing speed, token generation speed, Time To First Response, estimated prompt processing time, and end-to-end TTFT. It also supports configurable prompt length, generation length, context depth, and repeated runs with mean and standard deviation.\u003C\u002Fp>\u003Cp>One detail I like is the use of Hugging Face tokenizers for token counts. That matters because token counts can drift across templates and models, and a benchmark that guesses wrong about tokenization can produce neat-looking but misleading numbers. The tool also handles multi-token prediction chunks correctly, which is a sign that the author is thinking about modern serving behavior instead of old-school single-token assumptions.\u003C\u002Fp>\u003Cp>The README gives a concrete example with \u003Ca href=\"https:\u002F\u002Fopenai.com\u002Findex\u002Fgpt-oss-120b\u002F\" target=\"_blank\" rel=\"noopener\">openai\u002Fgpt-oss-120b\u003C\u002Fa>, a base URL of \u003Ccode>http:\u002F\u002Fspark:8888\u002Fv1\u003C\u002Fcode>, and depths from 0 to 32768. In that sample, prompt processing speed drops as depth rises: 8521.08 t\u002Fs at depth 0, 9450.36 t\u002Fs at 4096, 8481.42 t\u002Fs at 8192, 7954.96 t\u002Fs at 16384, and 6896.57 t\u002Fs at 32768.\u003C\u002Fp>\u003Cblockquote>\u003Cp>“It is widely used in \u003Ca href=\"\u002Ftag\u002Fllm\">LLM\u003C\u002Fa> community to benchmark models and allows to perform measurement at different context sizes.”\u003C\u002Fp>\u003Cfooter>— eugr, \u003Ca href=\"https:\u002F\u002Fgithub.com\u002Feugr\u002Fllama-benchy\" target=\"_blank\" rel=\"noopener\">llama-benchy README\u003C\u002Fa>\u003C\u002Ffooter>\u003C\u002Fblockquote>\u003Cp>That quote matters because it explains the project’s scope in plain language. The goal is not to invent a new benchmark philosophy. The goal is to make llama-bench-style measurements available to any OpenAI-compatible endpoint, which is a much narrower and more useful promise.\u003C\u002Fp>\u003Ch2>How the numbers compare in the sample run\u003C\u002Fh2>\u003Cp>The sample output shows why depth-aware testing is useful. The same model, same prompt size, and same generation length can produce very different latency numbers as context grows. At depth 0, the README shows TTFR at 240.36 ms and end-to-end TTFT at 340.65 ms. By depth 32768, those numbers rise to 5048.31 ms and 5153.34 ms. That is the difference between a snappy chat experience and a slow one that feels stuck before the first token appears.\u003C\u002Fp>\n\u003Cfigure class=\"my-6\">\u003Cimg src=\"https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1780775299359-uawh.png\" alt=\"llama-benchy brings llama-bench tests to APIs\" class=\"rounded-xl w-full\" loading=\"lazy\" \u002F>\u003C\u002Ffigure>\n\u003Cp>Token generation speed also shifts, though less dramatically. In the same example, tg32 goes from 73.18 t\u002Fs at depth 0 to 65.80 t\u002Fs at depth 32768. That is a useful reminder that \u003Ca href=\"\u002Fnews\u002Fwhy-minimax-m3-matters-long-context-model-en\">long context\u003C\u002Fa> does not only hurt prefill. It can also drag on generation, depending on the backend and serving path.\u003C\u002Fp>\u003Cul>\u003Cli>Depth 0: 8521.08 t\u002Fs prompt processing, 73.18 t\u002Fs generation\u003C\u002Fli>\u003Cli>Depth 4096: 9450.36 t\u002Fs prompt processing, 72.22 t\u002Fs generation\u003C\u002Fli>\u003Cli>Depth 8192: 8481.42 t\u002Fs prompt processing, 71.78 t\u002Fs generation\u003C\u002Fli>\u003Cli>Depth 16384: 7954.96 t\u002Fs prompt processing, 70.48 t\u002Fs generation\u003C\u002Fli>\u003Cli>Depth 32768: 6896.57 t\u002Fs prompt processing, 65.80 t\u002Fs generation\u003C\u002Fli>\u003C\u002Ful>\u003Cp>llama-benchy also tries to deal with the messiness of real servers. It can add noise to avoid cache hits, run a post-test command to clear state, and measure concurrency with multiple simultaneous requests. That makes it more useful for teams comparing throughput under load, especially when one backend behaves well in isolation but falls apart once traffic increases.\u003C\u002Fp>\u003Cp>There is also a practical installation story built around \u003Ca href=\"https:\u002F\u002Fdocs.astral.sh\u002Fuv\u002F\" target=\"_blank\" rel=\"noopener\">uv\u003C\u002Fa>. You can run it through \u003Ccode>uvx\u003C\u002Fcode>, install it into a virtual environment, use \u003Ccode>uv run\u003C\u002Fcode>, or install it system-wide. The README even includes release and main-branch paths, which is handy if you want to test the latest commit instead of waiting for a package release.\u003C\u002Fp>\u003Ch2>What to watch next\u003C\u002Fh2>\u003Cp>The current limitation is simple: llama-benchy only evaluates \u003Ccode>\u002Fv1\u002Fchat\u002Fcompletions\u003C\u002Fcode>. That keeps the scope focused, but it also means the tool does not yet cover every API shape that teams use in production. If the project expands to more endpoint types, it could become even more useful for comparing server behavior across chat, completions, and possibly streaming variants.\u003C\u002Fp>\u003Cp>For now, the strongest case for llama-benchy is that it measures what operators actually care about: how a model behaves when prompts get long, caches get involved, and concurrency rises. If you run OpenAI-compatible infrastructure, this is the kind of tool that can save you from trusting a single benchmark number that hides the real bottleneck.\u003C\u002Fp>\u003Cp>My bet is simple: the teams that adopt depth-aware API benchmarks early will spot serving regressions faster than the ones still relying on engine-local tests. The next question is whether more model servers will start publishing results in this format, because once that happens, comparisons get a lot harder to ignore.\u003C\u002Fp>","llama-benchy benchmarks OpenAI-compatible APIs with prompt, token, depth, and concurrency tests, plus TTFR and TTFT metrics.","github.com","https:\u002F\u002Fgithub.com\u002Feugr\u002Fllama-benchy",null,"https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1780775297695-nchl.png","tools","en","09c2902c-97a8-433c-94de-874a7f55d2ff",[17,18,19,20,21],"llama-benchy","LLM benchmarking","OpenAI-compatible APIs","prompt processing","TTFT",[23,24,25],"llama-benchy brings llama-bench-style testing to OpenAI-compatible endpoints.","It measures depth, prompt processing, token generation, TTFR, and TTFT.","The sample run shows latency rising sharply as context depth increases.",0,"2026-06-06T19:47:54.675055+00:00","2026-06-06T19:47:54.67+00:00","a7343b93-37cc-4634-a2bc-707f6275bdb6",{"tags":31,"relatedLang":41,"relatedPosts":45},[32,34,36,38,40],{"name":19,"slug":33},"openai-compatible-apis",{"name":20,"slug":35},"prompt-processing",{"name":18,"slug":37},"llm-benchmarking",{"name":21,"slug":39},"ttft",{"name":17,"slug":17},{"id":15,"slug":42,"title":43,"language":44},"llama-benchy-api-benchmark-zh","llama-benchy 把 API 也納入基準測試","zh",[46,52,58,64,70,76],{"id":47,"slug":48,"title":49,"cover_image":50,"image_url":50,"created_at":51,"category":13},"4065ada8-125b-4286-85c5-85cfe7d6369a","llm-leaderboard-2026-300-models-ranked-en","LLM Leaderboard 2026: 300+ Models Ranked","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1780776189065-qk79.png","2026-06-06T20:02:37.334702+00:00",{"id":53,"slug":54,"title":55,"cover_image":56,"image_url":56,"created_at":57,"category":13},"df69beef-d6a6-40d1-9284-474eebad74b7","how-to-start-vibe-coding-with-ai-en","How to Start Vibe Coding with AI","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1780773471455-gav1.png","2026-06-06T19:17:22.823911+00:00",{"id":59,"slug":60,"title":61,"cover_image":62,"image_url":62,"created_at":63,"category":13},"bece181a-96c8-494b-ac0b-fb254413e051","nvidia-ai-models-playbook-en","NVIDIA AI Models turn model hunting into a playbook","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1780771718084-8xiy.png","2026-06-06T18:48:07.10885+00:00",{"id":65,"slug":66,"title":67,"cover_image":68,"image_url":68,"created_at":69,"category":13},"40bf1841-77d9-4bf6-9764-3e956510d41a","kimi-k25-claude-code-cline-roocode-setup-en","Kimi K2.5 works in Claude Code and Cline","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1780769031200-z2kv.png","2026-06-06T18:03:19.685945+00:00",{"id":71,"slug":72,"title":73,"cover_image":74,"image_url":74,"created_at":75,"category":13},"258a698f-2ab5-47bf-9b3b-ec8a8e14b8be","why-small-businesses-should-use-ai-for-admin-en","Why small businesses should use AI for admin, not everything","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1780758184970-z888.png","2026-06-06T15:02:18.347592+00:00",{"id":77,"slug":78,"title":79,"cover_image":80,"image_url":80,"created_at":81,"category":13},"7da5424f-1ff8-483a-80ed-7091c5b0454b","crun-ai-gemini-omni-chat-video-editing-en","Crun AI turns Gemini Omni into chat video editing","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1780733910991-ji5m.png","2026-06-06T08:18:00.680201+00:00",[83,88,93,98,103,108,113,118,123,128],{"id":84,"slug":85,"title":86,"created_at":87},"8008f1a9-7a00-4bad-88c9-3eedc9c6b4b1","surepath-ai-mcp-policy-controls-en","SurePath AI's New MCP Policy Controls Enhance AI Security","2026-03-26T01:26:52.222015+00:00",{"id":89,"slug":90,"title":91,"created_at":92},"27e39a8f-b65d-4f7b-a875-859e2b210156","mcp-standard-ai-tools-2026-en","MCP Standard in 2026: Integrating AI Tools","2026-03-26T01:27:43.127519+00:00",{"id":94,"slug":95,"title":96,"created_at":97},"165f9a19-c92d-46ba-b3f0-7125f662921d","rag-2026-transforming-enterprise-ai-en","How RAG in 2026 is Transforming Enterprise AI","2026-03-26T01:28:11.485236+00:00",{"id":99,"slug":100,"title":101,"created_at":102},"6a2a8e6e-b956-49d8-be12-cc47bdc132b2","mastering-ai-prompts-2026-guide-en","Mastering AI Prompts: A 2026 Guide for Developers","2026-03-26T01:29:07.835148+00:00",{"id":104,"slug":105,"title":106,"created_at":107},"3ab2c67e-4664-4c67-a013-687a2f605814","garry-tan-open-sources-claude-code-toolkit-en","Garry Tan Open-Sources a Claude Code Toolkit","2026-03-26T08:26:20.245934+00:00",{"id":109,"slug":110,"title":111,"created_at":112},"66a7cbf8-7e76-41d4-9bbf-eaca9761bf69","github-ai-projects-to-watch-in-2026-en","20 GitHub AI Projects to Watch in 2026","2026-03-26T08:28:09.752027+00:00",{"id":114,"slug":115,"title":116,"created_at":117},"9f332fda-eace-448a-a292-2283951eee71","practical-github-guide-learning-ml-2026-en","A Practical GitHub Guide to Learning ML in 2026","2026-03-27T01:16:50.125678+00:00",{"id":119,"slug":120,"title":121,"created_at":122},"1b1f637d-0f4d-42bd-974b-07b53829144d","aiml-2026-student-ai-ml-lab-repo-review-en","AIML-2026 Is a Bare-Bones Student Lab Repo","2026-03-27T01:21:51.661231+00:00",{"id":124,"slug":125,"title":126,"created_at":127},"6d1bf3f6-e191-4d30-b55b-8a0722fa6afe","ai-trending-github-repos-and-research-feeds-en","AI Trending Tracks Repos and Research Feeds","2026-03-27T01:31:35.709532+00:00",{"id":129,"slug":130,"title":131,"created_at":132},"010539a1-4c3a-4bd3-937a-26616422ee0d","awesome-ai-for-science-research-tools-map-en","Awesome AI for Science Is Becoming a Real Research Map","2026-03-27T01:46:50.89513+00:00"]