Tag
AI benchmarks
AI benchmarks measure how models perform on reasoning, knowledge QA, coding, and long-context tasks. Scores from tests like ARC-AGI-2, GPQA, and MMLU help compare new releases, track real progress, and expose trade-offs between capability, cost, and reliability.
7 articles

18 AI benchmarks now rank GPT-5.5, Claude, Gemini
LM Council’s June 2026 benchmark hub compares 30+ models across 18 tests, with fresh scores from Epoch AI, Scale AI, and others.

AI Benchmarks 2026: Top Evaluations and Limits
MMLU, HLE, SWE-Bench and agent tests are hitting limits in 2026, while production gaps and contamination keep human review necessary.

LLM Stats makes 300+ AI benchmarks easy to compare
300+ AI and LLM benchmarks sit in one directory, with live leaderboards and verified scores for reasoning, coding, vision, and more.

GPT-5.5 tops Artificial Analysis with score of 60
Artificial Analysis ranks GPT-5.5 (xhigh) first on intelligence with a 60 score, comparing 523 models on speed, price, latency, and context.

Stanford’s 2026 AI Index, explained with charts
Stanford’s 2026 AI Index shows faster adoption, rising costs, and thin US-China gaps. The charts tell a messier story than the hype.

Gemini 3.1 Pro: Google’s new top model in numbers
Gemini 3.1 Pro posts 77.1% on ARC-AGI-2, 94.3% on GPQA Diamond, and a 1M-token context window, while keeping Gemini 3 pricing.

GPT-5.4 vs Claude Opus 4.6: 75% Win Rate
We tested GPT-5.4, Claude Opus 4.6, DeepSeek V4, and Gemini 3.1 across 12 benchmarks. One model won 9 of them.