Tag
LLM judges
2 articles

Research/May 14
Judge Reliability Harness Stress-Tests LLM Judges
A harness probes how LLM judges change under formatting, paraphrasing, verbosity, and flipped labels.

Research/Apr 17
How to Trust LLM Judges, Per Input
A diagnostic toolkit shows LLM judges can look stable on average while still being unreliable on individual inputs.