[TOOLS] 7 min readOraCore Editors

llama-benchy brings llama-bench tests to APIs

llama-benchy benchmarks OpenAI-compatible APIs with prompt, token, depth, and concurrency tests, plus TTFR and TTFT metrics.

Share LinkedIn
llama-benchy brings llama-bench tests to APIs

llama-benchy benchmarks OpenAI-compatible APIs with llama-bench-style depth tests and latency metrics.

llama-benchy is a benchmarking tool for OpenAI-compatible model endpoints that tries to answer a question most teams still hand-wave: how fast is this model when context gets longer, requests pile up, and the server starts caching? The project currently has 451 stars, 42 forks, and 96 commits, and its README makes a clear claim that it measures performance in a way closer to real API usage than engine-only tests.

That matters because model speed is rarely one number. Prompt processing, token generation, time to first response, and concurrency all change depending on the backend, the prompt shape, and whether the server reuses cache. llama-benchy tries to put those pieces into one CLI.

MetricWhat it measuresExample from README
ppPrompt processing speed2048-token prompt at depths from 0 to 32768
tgToken generation speed32 generated tokens in the sample run
depthContext length under test0, 4096, 8192, 16384, 32768
concurrencyParallel request loadConfigurable with --concurrency
runsRepeated trials per testDefault is 3

Why this tool exists

Get the latest AI news in your inbox

Weekly picks of model releases, tools, and deep dives — no spam, unsubscribe anytime.

No spam. Unsubscribe at any time.

The README opens with a complaint that will feel familiar to anyone who has tried to compare model servers across stacks. llama.cpp has llama-bench, but that benchmark only works inside the llama.cpp world. If you are running vLLM, SGLang, or another OpenAI-compatible server, you need a different way to compare them.

llama-benchy brings llama-bench tests to APIs

The author also calls out a practical problem with existing benchmarking flows: they can hide cache effects, misread the first response chunk as the first usable token, or make it awkward to test prompt processing at different context lengths. That is a real issue if you care about speculative decoding, multi-token prediction, or the gap between a lab benchmark and an actual chat endpoint.

There is a subtle but important design choice here. llama-benchy does not benchmark the inference engine directly. It benchmarks the API layer that users actually hit, which means the numbers include the quirks of request handling, streaming behavior, and server-side caching.

  • Targets /v1/chat/completions-style endpoints
  • Measures prompt processing and token generation separately
  • Uses real text from Project Gutenberg for prompts
  • Can run a coherence check after warmup
  • Exports Markdown, JSON, or CSV

What it measures and why that matters

The feature list is more useful than the usual benchmark README because it spells out the exact measurements. llama-benchy reports prompt processing speed, token generation speed, Time To First Response, estimated prompt processing time, and end-to-end TTFT. It also supports configurable prompt length, generation length, context depth, and repeated runs with mean and standard deviation.

One detail I like is the use of Hugging Face tokenizers for token counts. That matters because token counts can drift across templates and models, and a benchmark that guesses wrong about tokenization can produce neat-looking but misleading numbers. The tool also handles multi-token prediction chunks correctly, which is a sign that the author is thinking about modern serving behavior instead of old-school single-token assumptions.

The README gives a concrete example with openai/gpt-oss-120b, a base URL of http://spark:8888/v1, and depths from 0 to 32768. In that sample, prompt processing speed drops as depth rises: 8521.08 t/s at depth 0, 9450.36 t/s at 4096, 8481.42 t/s at 8192, 7954.96 t/s at 16384, and 6896.57 t/s at 32768.

“It is widely used in LLM community to benchmark models and allows to perform measurement at different context sizes.”

— eugr, llama-benchy README

That quote matters because it explains the project’s scope in plain language. The goal is not to invent a new benchmark philosophy. The goal is to make llama-bench-style measurements available to any OpenAI-compatible endpoint, which is a much narrower and more useful promise.

How the numbers compare in the sample run

The sample output shows why depth-aware testing is useful. The same model, same prompt size, and same generation length can produce very different latency numbers as context grows. At depth 0, the README shows TTFR at 240.36 ms and end-to-end TTFT at 340.65 ms. By depth 32768, those numbers rise to 5048.31 ms and 5153.34 ms. That is the difference between a snappy chat experience and a slow one that feels stuck before the first token appears.

llama-benchy brings llama-bench tests to APIs

Token generation speed also shifts, though less dramatically. In the same example, tg32 goes from 73.18 t/s at depth 0 to 65.80 t/s at depth 32768. That is a useful reminder that long context does not only hurt prefill. It can also drag on generation, depending on the backend and serving path.

  • Depth 0: 8521.08 t/s prompt processing, 73.18 t/s generation
  • Depth 4096: 9450.36 t/s prompt processing, 72.22 t/s generation
  • Depth 8192: 8481.42 t/s prompt processing, 71.78 t/s generation
  • Depth 16384: 7954.96 t/s prompt processing, 70.48 t/s generation
  • Depth 32768: 6896.57 t/s prompt processing, 65.80 t/s generation

llama-benchy also tries to deal with the messiness of real servers. It can add noise to avoid cache hits, run a post-test command to clear state, and measure concurrency with multiple simultaneous requests. That makes it more useful for teams comparing throughput under load, especially when one backend behaves well in isolation but falls apart once traffic increases.

There is also a practical installation story built around uv. You can run it through uvx, install it into a virtual environment, use uv run, or install it system-wide. The README even includes release and main-branch paths, which is handy if you want to test the latest commit instead of waiting for a package release.

What to watch next

The current limitation is simple: llama-benchy only evaluates /v1/chat/completions. That keeps the scope focused, but it also means the tool does not yet cover every API shape that teams use in production. If the project expands to more endpoint types, it could become even more useful for comparing server behavior across chat, completions, and possibly streaming variants.

For now, the strongest case for llama-benchy is that it measures what operators actually care about: how a model behaves when prompts get long, caches get involved, and concurrency rises. If you run OpenAI-compatible infrastructure, this is the kind of tool that can save you from trusting a single benchmark number that hides the real bottleneck.

My bet is simple: the teams that adopt depth-aware API benchmarks early will spot serving regressions faster than the ones still relying on engine-local tests. The next question is whether more model servers will start publishing results in this format, because once that happens, comparisons get a lot harder to ignore.