[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"article-llms-for-asr-evaluation-beyond-wer-en":3,"tags-llms-for-asr-evaluation-beyond-wer-en":30,"related-lang-llms-for-asr-evaluation-beyond-wer-en":31,"related-posts-llms-for-asr-evaluation-beyond-wer-en":35,"series-research-32cc2350-8bcf-4970-9bcd-900a05441f2f":72},{"id":4,"title":5,"content":6,"summary":7,"source":8,"source_url":9,"author":10,"image_url":11,"keywords":12,"language":18,"translated_content":10,"views":19,"is_premium":20,"created_at":21,"updated_at":21,"cover_image":11,"published_at":22,"rewrite_status":23,"rewrite_error":10,"rewritten_from_id":24,"slug":25,"category":26,"related_article_id":27,"status":28,"google_indexed_at":29,"x_posted_at":10},"32cc2350-8bcf-4970-9bcd-900a05441f2f","LLMs for ASR Evaluation: Beyond WER","\u003Cp>Automatic speech recognition still leans heavily on Word Error Rate, but WER only counts text mismatches — not whether the transcript preserved the meaning. This paper, \u003Ca href=\"https:\u002F\u002Farxiv.org\u002Fabs\u002F2604.21928\">Evaluation of Automatic Speech Recognition Using Generative Large Language Models\u003C\u002Fa>, asks a practical question: can generative LLMs do a better job of judging ASR output the way humans do?\u003C\u002Fp>\u003Cp>The short answer from the abstract is yes, at least in the tasks tested. The authors evaluate decoder-based LLMs as ASR evaluators in three different ways: picking the better of two hypotheses, measuring semantic distance through generative embeddings, and classifying errors qualitatively. For engineers building or evaluating speech systems, that matters because the metric you choose can change what you think your model is good at.\u003C\u002Fp>\u003Ch2>What problem this paper is trying to fix\u003C\u002Fh2>\u003Cp>WER is popular because it is simple and easy to compute, but the paper points out its core weakness: it is insensitive to meaning. Two transcripts can have the same WER while one preserves the speaker’s intent much better than the other. That makes WER a blunt instrument when you care about user experience, downstream NLP, or whether an ASR system actually got the message across.\u003C\u002Fp>\n\u003Cfigure class=\"my-6\">\u003Cimg src=\"https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1777010993439-cjdi.png\" alt=\"LLMs for ASR Evaluation: Beyond WER\" class=\"rounded-xl w-full\" loading=\"lazy\" \u002F>\u003C\u002Ffigure>\n\u003Cp>That gap has pushed researchers toward embedding-based semantic metrics, which try to compare meaning instead of exact word overlap. The authors say those metrics are better correlated with human perception, but they also note that decoder-based LLMs have been underexplored for this role. In other words, the field has tools that look more semantic than WER, but it has not fully tested whether generative LLMs can be used directly as evaluators.\u003C\u002Fp>\u003Cp>This is a real engineering problem, not just an academic one. If you are tuning an ASR model, comparing decoding strategies, or deciding whether a change improved product quality, you need evaluation that tracks what humans actually notice. A metric that misses semantic errors can send you optimizing the wrong thing.\u003C\u002Fp>\u003Ch2>How the method works in plain English\u003C\u002Fh2>\u003Cp>The paper studies decoder-based LLMs from three angles. First, it asks the model to select the best hypothesis between two ASR candidates. That is the simplest setup: given two possible transcripts, which one is closer to what was spoken?\u003C\u002Fp>\u003Cp>Second, it uses generative embeddings to compute semantic distance. The idea here is that the model’s internal representation can be used as a signal for how close two outputs are in meaning, not just in word choice. The paper compares this against encoder-based approaches and against semantic metrics more broadly.\u003C\u002Fp>\u003Cp>Third, it uses LLMs for qualitative error classification. That means the model is not only scoring outputs, but also helping label what kind of mistake occurred. For developers, that is potentially useful because it moves evaluation from a single scalar score toward something you can inspect and act on.\u003C\u002Fp>\u003Cp>The abstract does not give implementation details like prompts, decoding settings, or the exact evaluation protocol beyond the HATS dataset. So while the direction is clear, the public summary does not provide enough to reproduce every step from the abstract alone.\u003C\u002Fp>\u003Ch2>What the paper actually shows\u003C\u002Fh2>\u003Cp>The clearest result is on hypothesis selection. On the HATS dataset, the best LLMs achieve 92–94% agreement with human annotators when choosing between two candidates. The abstract says that compares with 63% for WER, and that the LLMs also outperform semantic metrics in this task.\u003C\u002Fp>\n\u003Cfigure class=\"my-6\">\u003Cimg src=\"https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1777011001966-z641.png\" alt=\"LLMs for ASR Evaluation: Beyond WER\" class=\"rounded-xl w-full\" loading=\"lazy\" \u002F>\u003C\u002Ffigure>\n\u003Cp>That is a meaningful result because it frames the evaluation problem in human terms. If a metric agrees with annotators more often, it is more likely to reflect what people perceive as the better transcript. In this case, the LLM-based approach appears much closer to human judgment than WER.\u003C\u002Fp>\u003Cp>The second result is about embeddings. The paper says embeddings from decoder-based LLMs show performance comparable to encoder models. That is important because decoder models are usually discussed as generation engines, not as feature extractors for evaluation. The takeaway is that you may not need a separate encoder-only stack to get useful semantic signals.\u003C\u002Fp>\u003Cp>The third result is more qualitative: LLMs offer a promising direction for interpretable and semantic ASR evaluation. The abstract does not provide benchmark numbers for the error-classification task, so there is no numeric claim to report there. What it does provide is a direction: LLMs may help explain not just whether an ASR output is wrong, but how it is wrong.\u003C\u002Fp>\u003Cul>\u003Cli>Hypothesis selection: 92–94% agreement with human annotators for the best LLMs\u003C\u002Fli>\u003Cli>WER baseline: 63% agreement in the same task\u003C\u002Fli>\u003Cli>Semantic metrics: outperformed by the best LLMs in hypothesis selection\u003C\u002Fli>\u003Cli>Embeddings: decoder-based LLM embeddings were comparable to encoder models\u003C\u002Fli>\u003C\u002Ful>\u003Ch2>Why developers should care\u003C\u002Fh2>\u003Cp>If you ship speech features, the metric you use shapes your product decisions. WER is still useful for regression testing and broad comparisons, but it can miss cases where two transcripts differ in wording while preserving the same intent, or vice versa. A semantic evaluator gives you a second lens that is closer to perceived quality.\u003C\u002Fp>\u003Cp>This paper suggests that generative LLMs could be part of that evaluation stack. That does not mean replacing WER everywhere. It means adding a semantic layer for cases where exact overlap is not enough: conversational assistants, voice search, transcription for downstream summarization, or any workflow where meaning matters more than surface form.\u003C\u002Fp>\u003Cp>There is also a practical debugging angle. Qualitative error classification can help teams understand whether failures are due to substitutions, omissions, or more semantic issues that a word-level metric would flatten. That can inform data collection, decoding changes, or post-processing decisions.\u003C\u002Fp>\u003Ch2>Limitations and open questions\u003C\u002Fh2>\u003Cp>The abstract is promising, but it is also limited. It names one dataset, HATS, and does not provide broader cross-dataset evidence in the summary. That means we should be careful about treating the reported numbers as universal.\u003C\u002Fp>\u003Cp>It also does not include the details you would want before adopting the approach in production: which LLMs were tested, how prompts were designed, how stable the judgments are, or what the compute cost looks like. Those details matter because evaluation tools have to be cheap, repeatable, and predictable if they are going to fit into CI pipelines or large-scale experiments.\u003C\u002Fp>\u003Cp>Another open question is whether the same gains hold outside pairwise hypothesis selection. Choosing between two transcripts is a constrained task; scoring arbitrary ASR outputs at scale is harder. The abstract suggests decoder-based embeddings are competitive, but it does not yet show a full replacement for established metrics.\u003C\u002Fp>\u003Cp>Still, the paper makes a strong case that ASR evaluation should move beyond exact word matching when the goal is to measure meaning. For developers, the practical message is straightforward: WER is not the whole story, and LLM-based semantic evaluation may be a better fit when human perception is the real target.\u003C\u002Fp>\u003Cp>In that sense, this paper is less about declaring WER obsolete and more about expanding the toolkit. If you build or evaluate speech systems, that is the kind of shift worth paying attention to.\u003C\u002Fp>","This paper tests decoder-based LLMs as ASR evaluators and finds they beat WER on human agreement, with 92–94% on one task.","arxiv.org","https:\u002F\u002Farxiv.org\u002Fabs\u002F2604.21928",null,"https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1777010993439-cjdi.png",[13,14,15,16,17],"ASR","WER","LLM evaluation","semantic metrics","speech recognition","en",0,false,"2026-04-24T06:09:38.008767+00:00","2026-04-24T06:09:37.977+00:00","done","ed8e09b4-94f7-44fe-86e4-7488539a9b20","llms-for-asr-evaluation-beyond-wer-en","research","b41b3999-fa8c-4e87-8914-4ed027fe8bfe","published","2026-04-24T09:00:07.741+00:00",[],{"id":27,"slug":32,"title":33,"language":34},"llms-for-asr-evaluation-beyond-wer-zh","LLM 評測 ASR 不只看 WER","zh",[36,42,48,54,60,66],{"id":37,"slug":38,"title":39,"cover_image":40,"image_url":40,"created_at":41,"category":26},"13b6551e-f990-4e6b-aa8d-e410b134df43","task-boundaries-can-skew-continual-learning-results-en","Task boundaries can skew continual learning results","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1777010808102-k8tq.png","2026-04-24T06:06:31.283203+00:00",{"id":43,"slug":44,"title":45,"cover_image":46,"image_url":46,"created_at":47,"category":26},"d14b7cf3-fc52-4889-9d8f-ba08deb677b8","teaching-video-models-understand-time-en","Teaching Video Models to Understand Time","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1777010642456-9rzz.png","2026-04-24T06:03:37.245074+00:00",{"id":49,"slug":50,"title":51,"cover_image":52,"image_url":52,"created_at":53,"category":26},"b712257f-129d-400a-bc73-5e1c3ab200a4","avise-ai-security-evaluation-framework-en","AVISE tests AI security with modular jailbreak evals","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1776924767358-ocir.png","2026-04-23T06:12:31.125572+00:00",{"id":55,"slug":56,"title":57,"cover_image":58,"image_url":58,"created_at":59,"category":26},"0e7d8f32-289f-4117-861c-6feb9bd2eb29","parallel-sft-code-rl-cross-language-transfer-en","Parallel-SFT aims to make code RL transfer better","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1776924587865-otqv.png","2026-04-23T06:09:32.496091+00:00",{"id":61,"slug":62,"title":63,"cover_image":64,"image_url":64,"created_at":65,"category":26},"2a6b0902-8cf2-42c9-9b38-59e6ed0294c9","speechparaling-bench-paralinguistic-speech-generation-en","SpeechParaling-Bench tests speech models on nuance","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1776924234257-ns8c.png","2026-04-23T06:03:39.315548+00:00",{"id":67,"slug":68,"title":69,"cover_image":70,"image_url":70,"created_at":71,"category":26},"89d74343-03a7-4325-88e0-14029dab320d","safe-continual-rl-changing-environments-en","Safe Continual RL for Changing Real-World Systems","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1776838195882-6v8v.png","2026-04-22T06:09:33.432376+00:00",[73,78,83,88,93,98,103,108,113,118],{"id":74,"slug":75,"title":76,"created_at":77},"a2715e72-1fe8-41b3-abb1-d0cf1f710189","ai-predictions-2026-big-changes-en","AI Predictions for 2026: Brace for Big Changes","2026-03-26T01:25:07.788356+00:00",{"id":79,"slug":80,"title":81,"created_at":82},"8404bd7b-4c2f-4109-9ec4-baf29d88af2b","ml-papers-of-the-week-github-research-desk-en","ML Papers of the Week Turns GitHub Into a Research Desk","2026-03-27T01:11:39.480259+00:00",{"id":84,"slug":85,"title":86,"created_at":87},"87897a94-8065-4464-a016-1f23e89e17cc","ai-ml-conferences-to-watch-in-2026-en","AI\u002FML Conferences to Watch in 2026","2026-03-27T01:51:54.184108+00:00",{"id":89,"slug":90,"title":91,"created_at":92},"6f1987cf-25f3-47a4-b3e6-db0997695be8","openclaw-agents-manipulated-self-sabotage-en","OpenClaw Agents Can Be Manipulated Into Failure","2026-03-28T03:03:18.899465+00:00",{"id":94,"slug":95,"title":96,"created_at":97},"a53571ad-735a-4178-9f93-cb09b699d99c","vega-driving-language-instructions-en","Vega: Driving with Natural Language Instructions","2026-03-28T14:54:04.698882+00:00",{"id":99,"slug":100,"title":101,"created_at":102},"a34581d6-f36e-46da-88bb-582fb3e7425c","personalizing-autonomous-driving-styles-en","Drive My Way: Personalizing Autonomous Driving Styles","2026-03-28T14:54:26.148181+00:00",{"id":104,"slug":105,"title":106,"created_at":107},"2bc1ad7f-26ce-4f02-9885-803b35fd229d","training-knowledge-bases-writeback-rag-en","Training Knowledge Bases with WriteBack-RAG","2026-03-28T14:54:45.643433+00:00",{"id":109,"slug":110,"title":111,"created_at":112},"71adc507-3c54-4605-bbe2-c966acd6187e","packforcing-long-video-generation-en","PackForcing: Efficient Long-Video Generation Method","2026-03-28T14:55:02.646943+00:00",{"id":114,"slug":115,"title":116,"created_at":117},"675942ef-b9ec-4c5f-a997-381250b6eacb","pixelsmile-facial-expression-editing-en","PixelSmile Framework Enhances Facial Expression Editing","2026-03-28T14:55:20.633463+00:00",{"id":119,"slug":120,"title":121,"created_at":122},"6954fa2b-8b66-4839-884b-e46f89fa1bc3","adaptive-block-scaled-data-types-en","IF4: Smarter 4-Bit Quantization That Adapts to Your Data","2026-03-31T06:00:36.65963+00:00"]