[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"tag-llm-evaluation":3},{"tag":4,"articles":11,"peer_article_count":71},{"id":5,"name":6,"slug":7,"article_count":8,"description_zh":9,"description_en":10},"01e6c9b3-37d1-4d59-962e-34209b71a5cb","LLM evaluation","llm-evaluation",3,"LLM 評估關注模型是否真的理解與推理，而不只是答對單題。常見面向包括長鏈推理、ASR 轉寫品質判定、與人類標註一致性，以及在多步驟任務中維持穩定表現的能力。","LLM evaluation examines whether models reason, judge, and stay consistent beyond producing a plausible answer. It spans long-horizon benchmarks like LongCoT, ASR quality assessment, and agreement with human labels on tasks where accuracy alone misses real failure modes.",[12,21,28,35,42,49,57,64],{"id":13,"slug":14,"title":15,"summary":16,"category":17,"image_url":18,"cover_image":18,"language":19,"created_at":20},"c522f9af-2862-4f1c-bbf9-99bc20c78544","measuring-llm-behavior-portability-en","Measuring when LLM behavior actually переносится","A new framework tests whether an LLM’s behavior transfers across payoff-equivalent decision environments.","research","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1782717476648-9gjo.png","en","2026-06-29T07:17:30.115953+00:00",{"id":22,"slug":23,"title":24,"summary":25,"category":17,"image_url":26,"cover_image":26,"language":19,"created_at":27},"e891adc0-af64-41c7-bb41-d75e6506d388","ai-benchmarks-2026-evaluations-limits-en","AI Benchmarks 2026: Top Evaluations and Limits","MMLU, HLE, SWE-Bench and agent tests are hitting limits in 2026, while production gaps and contamination keep human review necessary.","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1781381870944-h208.png","2026-06-13T20:17:26.361723+00:00",{"id":29,"slug":30,"title":31,"summary":32,"category":17,"image_url":33,"cover_image":33,"language":19,"created_at":34},"180a8696-ada6-43c3-ac47-5b6cea8e0b31","confident-ai-llm-evaluation-metrics-guide-en","Confident AI’s guide to LLM evaluation metrics","Confident AI explains how to score LLMs with metrics that match correctness, relevance, hallucination, and agent task completion.","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1779178451812-i778.png","2026-05-19T08:13:46.826703+00:00",{"id":36,"slug":37,"title":38,"summary":39,"category":17,"image_url":40,"cover_image":40,"language":19,"created_at":41},"653c628b-7930-4183-9dbc-8e50cf85c479","cattle-trade-llm-bluffing-bargaining-benchmark-en","Cattle Trade benchmarks LLM bluffing and bargaining","Cattle Trade is a multi-agent benchmark for testing how LLMs bluff, bid, and bargain in negotiation tasks.","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1779085436536-nesm.png","2026-05-18T06:23:28.591525+00:00",{"id":43,"slug":44,"title":45,"summary":46,"category":17,"image_url":47,"cover_image":47,"language":19,"created_at":48},"7ac3d870-d844-4d95-a287-81b22dfa9eca","deeptest-2026-llm-car-manual-assistant-en","DeepTest 2026 benchmarks an LLM car manual assistant","DeepTest’s first LLM testing competition compared four tools on car manual retrieval, showing how to benchmark automotive assistants.","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1778048468789-e7sx.png","2026-05-06T06:20:33.071908+00:00",{"id":50,"slug":51,"title":52,"summary":53,"category":54,"image_url":55,"cover_image":55,"language":19,"created_at":56},"b2450abd-b108-4e4d-b1d7-1b02c17db850","why-databricks-rag-is-platform-play-not-feature-en","Why Databricks RAG Is a Platform Play, Not a Feature","Databricks treats RAG as an end-to-end platform problem, and that is the right way to build it.","industry","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1777959651374-avrm.png","2026-05-05T05:40:30.329823+00:00",{"id":58,"slug":59,"title":60,"summary":61,"category":17,"image_url":62,"cover_image":62,"language":19,"created_at":63},"32cc2350-8bcf-4970-9bcd-900a05441f2f","llms-for-asr-evaluation-beyond-wer-en","LLMs for ASR Evaluation: Beyond WER","This paper tests decoder-based LLMs as ASR evaluators and finds they beat WER on human agreement, with 92–94% on one task.","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1777010993439-cjdi.png","2026-04-24T06:09:38.008767+00:00",{"id":65,"slug":66,"title":67,"summary":68,"category":17,"image_url":69,"cover_image":69,"language":19,"created_at":70},"9f62add5-cae5-47eb-abd5-2e56d0d5698c","longcot-long-horizon-chain-of-thought-benchmark-en","LongCoT Benchmark: 2,500-Probl. Long-Horizon Reasoning","LongCoT is a 2,500-problem benchmark for measuring whether frontier models can sustain long, interdependent reasoning chains.","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1776319782523-s0wz.png","2026-04-16T06:09:23.265233+00:00",2]