Tag

benchmark

Benchmarking is how teams check whether models, agents, and compilers hold up under fixed tasks and real constraints. It covers long-horizon reasoning, data-viz workflows, code safety, and performance, while also exposing how much a score can be distorted by the test itself.

12 articles

Research/Jun 22

Rootly benchmark: Llama 4 trails coding models

Rootly AI Labs says Llama 4 lagged coding-focused models on a Mastodon GitHub benchmark, with GPT-4o and Qwen2.5-Coder ahead.

Research/Jun 17

ReproRepo scales reproducibility audits with GitHub issues

ReproRepo uses GitHub issues to scale reproducibility audits for machine learning papers.

Research/Jun 15

ClinHallu maps where medical MLLMs hallucinate

ClinHallu diagnoses where medical MLLM hallucinations come from across vision, knowledge, and reasoning stages.

AI Agent/Jun 13

Fable 5 让 Claude Code 更像真同事

我拆了这篇测评，整理出一套把 Fable 5 用进 coding 和 agent 工作流的可复制模板。

Research/Jun 12

EvoArena tests LLM agents in changing worlds

EvoArena benchmarks how LLM agents handle changing environments, and EvoMem adds patch-based memory updates to help them adapt.

Model Releases/May 23

GPT-5.5 scores 62.5 on Every’s engineer test

Every says GPT-5.5 beat Opus 4.7 on its Senior Engineer Benchmark, scoring 62.5 on its best run and landing as OpenAI’s work model.

Research/May 18

Cattle Trade benchmarks LLM bluffing and bargaining

Cattle Trade is a multi-agent benchmark for testing how LLMs bluff, bid, and bargain in negotiation tasks.

Research/May 16

EntityBench Tackles Long-Range Video Consistency

EntityBench measures whether video models keep characters, objects, and locations consistent across long, multi-shot sequences.

Research/May 13

LongMemEval-V2 tests agent memory in web workflows

A new benchmark checks whether agent memory can retain web-environment experience, not just user history, and improve long-term task recall.

Research/May 4

When LLMs Stop Following Procedural Steps

A diagnostic benchmark shows LLMs lose procedural fidelity as step counts grow, even when the arithmetic stays simple.

Research/Apr 20

ASMR-Bench Tests Sabotage Detection in ML Code

ASMR-Bench probes whether auditors can spot subtle sabotage in ML research codebases, and the answer so far is: not reliably.

Research/Apr 16

LongCoT Benchmark: 2,500-Probl. Long-Horizon Reasoning

LongCoT is a 2,500-problem benchmark for measuring whether frontier models can sustain long, interdependent reasoning chains.