Tag
benchmark
Benchmarking is how teams check whether models, agents, and compilers hold up under fixed tasks and real constraints. It covers long-horizon reasoning, data-viz workflows, code safety, and performance, while also exposing how much a score can be distorted by the test itself.
12 articles

Rootly benchmark: Llama 4 trails coding models
Rootly AI Labs says Llama 4 lagged coding-focused models on a Mastodon GitHub benchmark, with GPT-4o and Qwen2.5-Coder ahead.

ReproRepo scales reproducibility audits with GitHub issues
ReproRepo uses GitHub issues to scale reproducibility audits for machine learning papers.

ClinHallu maps where medical MLLMs hallucinate
ClinHallu diagnoses where medical MLLM hallucinations come from across vision, knowledge, and reasoning stages.

Fable 5 让 Claude Code 更像真同事
我拆了这篇测评,整理出一套把 Fable 5 用进 coding 和 agent 工作流的可复制模板。

EvoArena tests LLM agents in changing worlds
EvoArena benchmarks how LLM agents handle changing environments, and EvoMem adds patch-based memory updates to help them adapt.

GPT-5.5 scores 62.5 on Every’s engineer test
Every says GPT-5.5 beat Opus 4.7 on its Senior Engineer Benchmark, scoring 62.5 on its best run and landing as OpenAI’s work model.

Cattle Trade benchmarks LLM bluffing and bargaining
Cattle Trade is a multi-agent benchmark for testing how LLMs bluff, bid, and bargain in negotiation tasks.

EntityBench Tackles Long-Range Video Consistency
EntityBench measures whether video models keep characters, objects, and locations consistent across long, multi-shot sequences.

LongMemEval-V2 tests agent memory in web workflows
A new benchmark checks whether agent memory can retain web-environment experience, not just user history, and improve long-term task recall.

When LLMs Stop Following Procedural Steps
A diagnostic benchmark shows LLMs lose procedural fidelity as step counts grow, even when the arithmetic stays simple.

ASMR-Bench Tests Sabotage Detection in ML Code
ASMR-Bench probes whether auditors can spot subtle sabotage in ML research codebases, and the answer so far is: not reliably.

LongCoT Benchmark: 2,500-Probl. Long-Horizon Reasoning
LongCoT is a 2,500-problem benchmark for measuring whether frontier models can sustain long, interdependent reasoning chains.