Tag

AI benchmarks

AI benchmarks measure how models perform on reasoning, knowledge QA, coding, and long-context tasks. Scores from tests like ARC-AGI-2, GPQA, and MMLU help compare new releases, track real progress, and expose trade-offs between capability, cost, and reliability.

7 articles

Research/Jun 17

18 AI benchmarks now rank GPT-5.5, Claude, Gemini

LM Council’s June 2026 benchmark hub compares 30+ models across 18 tests, with fresh scores from Epoch AI, Scale AI, and others.

Research/Jun 14

AI Benchmarks 2026: Top Evaluations and Limits

MMLU, HLE, SWE-Bench and agent tests are hitting limits in 2026, while production gaps and contamination keep human review necessary.

Industry News/Jun 9

LLM Stats makes 300+ AI benchmarks easy to compare

300+ AI and LLM benchmarks sit in one directory, with live leaderboards and verified scores for reasoning, coding, vision, and more.

Tools & Apps/May 23

GPT-5.5 tops Artificial Analysis with score of 60

Artificial Analysis ranks GPT-5.5 (xhigh) first on intelligence with a 60 score, comparing 523 models on speed, price, latency, and context.

Research/Apr 17

Stanford’s 2026 AI Index, explained with charts

Stanford’s 2026 AI Index shows faster adoption, rising costs, and thin US-China gaps. The charts tell a messier story than the hype.

Model Releases/Apr 3

Gemini 3.1 Pro: Google’s new top model in numbers

Gemini 3.1 Pro posts 77.1% on ARC-AGI-2, 94.3% on GPQA Diamond, and a 1M-token context window, while keeping Gemini 3 pricing.

Model Releases/Apr 2

GPT-5.4 vs Claude Opus 4.6: 75% Win Rate

We tested GPT-5.4, Claude Opus 4.6, DeepSeek V4, and Gemini 3.1 across 12 benchmarks. One model won 9 of them.