Tag
LLM evaluation
LLM evaluation examines whether models reason, judge, and stay consistent beyond producing a plausible answer. It spans long-horizon benchmarks like LongCoT, ASR quality assessment, and agreement with human labels on tasks where accuracy alone misses real failure modes.
8 articles

Measuring when LLM behavior actually переносится
A new framework tests whether an LLM’s behavior transfers across payoff-equivalent decision environments.

AI Benchmarks 2026: Top Evaluations and Limits
MMLU, HLE, SWE-Bench and agent tests are hitting limits in 2026, while production gaps and contamination keep human review necessary.

Confident AI’s guide to LLM evaluation metrics
Confident AI explains how to score LLMs with metrics that match correctness, relevance, hallucination, and agent task completion.

Cattle Trade benchmarks LLM bluffing and bargaining
Cattle Trade is a multi-agent benchmark for testing how LLMs bluff, bid, and bargain in negotiation tasks.

DeepTest 2026 benchmarks an LLM car manual assistant
DeepTest’s first LLM testing competition compared four tools on car manual retrieval, showing how to benchmark automotive assistants.

Why Databricks RAG Is a Platform Play, Not a Feature
Databricks treats RAG as an end-to-end platform problem, and that is the right way to build it.

LLMs for ASR Evaluation: Beyond WER
This paper tests decoder-based LLMs as ASR evaluators and finds they beat WER on human agreement, with 92–94% on one task.

LongCoT Benchmark: 2,500-Probl. Long-Horizon Reasoning
LongCoT is a 2,500-problem benchmark for measuring whether frontier models can sustain long, interdependent reasoning chains.