LLM Leaderboard 2026: 300+ Models Ranked

OraCore Editors

[TOOLS] June 7, 20267 min readOraCore Editors

LLM Leaderboard 2026: 300+ Models Ranked

LLM Stats now ranks 309 models by score, speed, and price, with hourly updates from benchmarks and live API measurements.

Share LinkedIn

LLM Leaderboard 2026: 300+ Models Ranked

LLM Stats ranks 309 AI models by score, speed, and price.

The new LLM Leaderboard tracks 309 canonical models and updates pricing and performance data on an hourly cadence. It mixes public benchmark results with live API measurements, which makes it more useful than a static “best model” chart.

That matters because the top model for coding is not always the best pick for reasoning, cost, or latency. On this board, the leaders split across several metrics: Claude Opus 4.6 leads coding arena performance, Claude Mythos Preview tops GPQA Diamond, and Gemini 3 Pro posts a perfect AIME 2025 score.

Metric	Leader	Value
Models tracked	LLM Stats leaderboard	309
Best for coding	Claude Opus 4.6	21.3 arena score
Best on GPQA Diamond	Claude Mythos Preview	94.6%
Best on AIME 2025	Gemini 3 Pro	100.0%
Highest throughput	Mercury 2	925 tok/s
Largest context window	Grok 4 Fast	2.0M tokens

What LLM Stats is actually ranking

Get the latest AI news in your inbox

Weekly picks of model releases, tools, and deep dives — no spam, unsubscribe anytime.

No spam. Unsubscribe at any time.

LLM Stats is trying to answer a very practical question: which model should you pay for today? Instead of relying on a single benchmark, it combines intelligence signals, output speed, latency, and token pricing into one score.

The leaderboard page also exposes the raw ingredients behind that score. You can sort by organization, parameters, hardware, context window, license, modality, price, country, and speed. That is useful because model choice now depends on deployment constraints as much as benchmark bragging rights.

309 canonical models are tracked
Pricing is pulled from public API price lists and checked against billing samples
Live performance uses a 7-day rolling average
Metadata and pricing revalidate every hour

The design is opinionated in a good way. It does not pretend that every model should be judged the same way, and it does not hide the trade-offs between a pricey frontier model and a cheaper open model that can still ship real work.

The numbers tell a more useful story than a single rank

The top rows make the point clearly. Anthropic's Claude Opus 4.6 shows 39 c/s speed, 1M context, and $5.00 per million input tokens with $25.00 per million output tokens. OpenAI's GPT-5.5 pushes 150 c/s, but costs the same $5.00/$30.00 split at the top end of the table.

Then the cheaper options start making a case for themselves. Google's Gemini 3 Flash sits at $0.50 input and $3.00 output per million tokens while still hitting 247 c/s. Qwen's Qwen3.7 Max comes in at $1.25 and $3.75, with 120 c/s and a 1M context window.

Claude Opus 4.6: 2,132 score, 39 c/s, 1M context
GPT-5.5: 2,105 score, 150 c/s, 1.1M context
Gemini 3.1 Pro: 2,101 score, 164 c/s, $2.50 input and $15.00 output
Qwen3.7 Max: 1,634 score, 120 c/s, 1M context

If you build with models all day, the pattern is familiar: the fastest model is often not the cheapest, and the best benchmark score rarely arrives with friendly pricing. LLM Stats makes that tension visible without forcing you to guess from vendor marketing pages.

Why the methodology matters

The site says model order is based on coding-arena score when available, then GPQA Diamond. That choice is telling. Coding arenas are a strong proxy for practical agentic work, while GPQA Diamond catches models that can handle hard knowledge and reasoning tasks without sounding confident and wrong.

LLM Stats also says it measures output throughput and time-to-first-token through standardized prompts routed across major API providers, averaged over a 7-day rolling window. That is a better fit for real usage than a one-off launch benchmark, especially when latency can change with provider load, routing, and model updates.

“The coding arena is the most discriminating signal at the frontier,” the LLM Stats FAQ says.

That line is the clearest statement of the product’s philosophy. Instead of chasing one universal number, the leaderboard tries to separate the kinds of intelligence that matter in production: code generation, deep reasoning, long-context work, and tool use.

It also explains why the leaderboard can feel different from the usual social-media ranking posts. A model that wins one benchmark may still lose on price, latency, or context length, and the page keeps those trade-offs in view.

How the leaderboard compares on the metrics that matter

The comparison view is where the site becomes genuinely useful. You can put models side by side and inspect code arena, reasoning, math, coding, search, writing, vision, tools, and long-context performance in one place.

That makes it easier to answer questions like: should I pay for a premium closed model, use a cheaper fast model, or pick an open model that is good enough for the task? The answer changes depending on whether you care about throughput, token cost, or a benchmark like SWE-bench Verified.

Claude Opus 4.6 leads coding arena with 21.3
Mercury 2 reaches 925 tok/s, the highest throughput listed
Nemotron 3 Nano costs $0.06 per 1M input tokens, the cheapest input price shown
Grok 4 Fast offers a 2.0M token context window, the largest listed

Those four numbers tell a better story than a single “best model” badge. If you are building agents, the fastest model may matter more than the highest score. If you are doing long-document analysis, context window can beat raw benchmark rank. If you are serving users at scale, token price can be the deciding factor.

For OraCore readers, the practical takeaway is simple: use the leaderboard as a shortlist tool, not an oracle. Start with the metric that matches your product, then compare the top few rows instead of chasing the overall rank.

What changes next for model selection

LLM release cycles are moving fast enough that a static “best model” post goes stale almost immediately. A continuously updated board like this is more useful because it reflects live pricing, fresh benchmark submissions, and provider-side performance changes.

My read: the next phase of model selection will look less like picking one winner and more like choosing among specialized leaders. Coding, reasoning, cheap inference, long context, and tool use are already splitting apart, and this leaderboard makes that split obvious.

If you are shipping with LLMs now, the smart move is to keep a shortlist of two premium models and one budget option, then re-check the numbers whenever your workload changes. The leaderboard already gives you the inputs; the real question is which metric your product should care about first.

// Related Articles

LLM Leaderboard 2026: 300+ Models Ranked

What LLM Stats is actually ranking

Get the latest AI news in your inbox

The numbers tell a more useful story than a single rank

Why the methodology matters

How the leaderboard compares on the metrics that matter

What changes next for model selection

Spark 4.2 turns AI search into SQL

OpenAI's HF breach story turns into a security template

SAP Design System adds AI and cross-platform UI kits

ChatGPT Health turns general chat into a health layer

Microsoft adds AMD chips to Azure AI and HPC

Kimi K3 vs GLM-5.2: a one-endpoint test