[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"tag-llm-benchmarks":3},{"tag":4,"articles":11,"peer_article_count":72},{"id":5,"name":6,"slug":7,"article_count":8,"description_zh":9,"description_en":10},"ee654d61-465d-4eec-8060-5b4afb694d7b","LLM benchmarks","llm-benchmarks",3,"LLM 基準測試用來比較模型在知識、數學推理、幻覺率、長上下文與對話品質上的表現，像 BenchLM、AIME 這類榜單常反映模型升級的實際差異，也影響選型與部署判斷。","LLM benchmarks compare models across knowledge, math reasoning, hallucination rate, long-context handling, and chat quality. Results from tests like BenchLM or AIME help teams judge real capability, not just model size or release hype.",[12,21,29,36,43,51,58,65],{"id":13,"slug":14,"title":15,"summary":16,"category":17,"image_url":18,"cover_image":18,"language":19,"created_at":20},"aa623191-8abe-4e33-84ed-a52a431716c1","llm-stats-ai-benchmarks-compare-en","LLM Stats makes 300+ AI benchmarks easy to compare","300+ AI and LLM benchmarks sit in one directory, with live leaderboards and verified scores for reasoning, coding, vision, and more.","industry","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1780973273971-9hfy.png","en","2026-06-09T02:47:23.038487+00:00",{"id":22,"slug":23,"title":24,"summary":25,"category":26,"image_url":27,"cover_image":27,"language":19,"created_at":28},"7a07f021-272f-480d-87c1-a76b203f9b71","2026-domain-specific-llm-benchmarks-map-en","2026 domain-specific LLM benchmarks map","Kili Technology maps 2026 vertical LLM benchmarks across medicine, law, finance, code, cybersecurity, multilingual, and multimodal use cases.","research","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1779649574153-4g4j.png","2026-05-24T19:05:43.423296+00:00",{"id":30,"slug":31,"title":32,"summary":33,"category":17,"image_url":34,"cover_image":34,"language":19,"created_at":35},"9b2db204-7090-4a48-85e0-65693e66152e","5-llm-benchmarks-for-business-buyers-2026-en","5 LLM benchmarks for business buyers in 2026","5 benchmarks show what frontier models can do, where scores fail, and which tests matter most for business use in 2026.","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1779161052982-t6g1.png","2026-05-19T03:23:41.513761+00:00",{"id":37,"slug":38,"title":39,"summary":40,"category":17,"image_url":41,"cover_image":41,"language":19,"created_at":42},"11b9773e-13af-447d-b9a1-7d3232201e4f","why-llm-leaderboards-are-wrong-about-model-quality-en","Why LLM Leaderboards Are Wrong About Model Quality","LLM leaderboards are useful, but they are the wrong way to choose a model for production.","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1778743847206-191w.png","2026-05-14T07:30:26.134864+00:00",{"id":44,"slug":45,"title":46,"summary":47,"category":48,"image_url":49,"cover_image":49,"language":19,"created_at":50},"0c006cb0-0acc-43c4-baba-ab78092f0d9b","kimi-k2-6-benchlm-2026-scores-en","Kimi K2.6 Scores: BenchLM’s 2026 Breakdown","Kimi K2.6 ranks #12 overall on BenchLM, with strong coding and agentic scores, plus a 256K context window and open weights.","model-release","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1777900276785-cezo.png","2026-05-04T13:10:39.364394+00:00",{"id":52,"slug":53,"title":54,"summary":55,"category":48,"image_url":56,"cover_image":56,"language":19,"created_at":57},"cb45188a-2e6e-4ac7-95f0-39cbd2f7d7a2","gpt-5-4-benchmarks-2026-scores-rankings-en","GPT-5.4 Scores 97.6 in Knowledge Benchmarks","GPT-5.4 tops knowledge benchmarks with 97.6, ranks #2 overall on BenchLM, and posts a 1.05M-token context window.","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1776082204490-nq2r.png","2026-04-13T12:09:40.792366+00:00",{"id":59,"slug":60,"title":61,"summary":62,"category":26,"image_url":63,"cover_image":63,"language":19,"created_at":64},"1433056d-0745-485f-9501-b6ce042e5516","aime-2026-leaderboard-qwen-leads-math-tests-en","AIME 2026 leaderboard: Qwen leads math tests","Qwen3.6 Plus tops the AIME 2026 math benchmark with 0.953, while 8 models show a wide gap in olympiad-style reasoning.","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1775179307904-87vj.png","2026-04-03T01:21:30.991592+00:00",{"id":66,"slug":67,"title":68,"summary":69,"category":48,"image_url":70,"cover_image":70,"language":19,"created_at":71},"a1ce1fa4-f4d5-4e96-93dc-2c39628ec0a3","grok-41-xai-quieter-upgrade-matters-en","Grok 4.1: xAI’s quieter upgrade that matters","xAI’s Grok 4.1 cuts hallucinations, boosts chat quality, and adds Fast and Thinking modes with 256k context and 2M-token API support.","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1775175352422-pgev.png","2026-04-03T00:15:30.256357+00:00",4]