Tag

LLM evaluation

LLM evaluation examines whether models reason, judge, and stay consistent beyond producing a plausible answer. It spans long-horizon benchmarks like LongCoT, ASR quality assessment, and agreement with human labels on tasks where accuracy alone misses real failure modes.

8 articles

Research/Jun 29

Measuring when LLM behavior actually переносится

A new framework tests whether an LLM’s behavior transfers across payoff-equivalent decision environments.

Research/Jun 14

AI Benchmarks 2026: Top Evaluations and Limits

MMLU, HLE, SWE-Bench and agent tests are hitting limits in 2026, while production gaps and contamination keep human review necessary.

Research/May 19

Confident AI’s guide to LLM evaluation metrics

Confident AI explains how to score LLMs with metrics that match correctness, relevance, hallucination, and agent task completion.

Research/May 18

Cattle Trade benchmarks LLM bluffing and bargaining

Cattle Trade is a multi-agent benchmark for testing how LLMs bluff, bid, and bargain in negotiation tasks.

Research/May 6

DeepTest 2026 benchmarks an LLM car manual assistant

DeepTest’s first LLM testing competition compared four tools on car manual retrieval, showing how to benchmark automotive assistants.

Industry News/May 5

Why Databricks RAG Is a Platform Play, Not a Feature

Databricks treats RAG as an end-to-end platform problem, and that is the right way to build it.

Research/Apr 24

LLMs for ASR Evaluation: Beyond WER

This paper tests decoder-based LLMs as ASR evaluators and finds they beat WER on human agreement, with 92–94% on one task.

Research/Apr 16

LongCoT Benchmark: 2,500-Probl. Long-Horizon Reasoning

LongCoT is a 2,500-problem benchmark for measuring whether frontier models can sustain long, interdependent reasoning chains.