[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"article-benchmark-harness-quality-beats-model-hype-coding-zh":3,"article-related-benchmark-harness-quality-beats-model-hype-coding-zh":31,"series-ai-agent-8fe481ef-010f-431b-a837-22ccafa68438":74},{"id":4,"slug":5,"title":6,"content":7,"summary":8,"source":9,"source_url":10,"author":11,"image_url":12,"cover_image":12,"category":13,"language":14,"translated_content":11,"related_article_id":15,"keywords":16,"key_takeaways":22,"views":27,"created_at":28,"published_at":29,"topic_cluster_id":30},"8fe481ef-010f-431b-a837-22ccafa68438","benchmark-harness-quality-beats-model-hype-coding-zh","這個 coding benchmark 證明：harness 品質勝過模型光環","\u003Cp data-speakable=\"summary\">這篇主張：評估 coding 模型時，決定結果的不是模型品牌，而是 \u003Ca href=\"\u002Ftag\u002Fbenchmark\">benchmark\u003C\u002Fa> harness 的設計品質。\u003C\u002Fp>\u003Cp>GitHub 的 llm-coding-benchmark repo 給出很直接的證據：如果你要判斷 coding 模型，harness 比模型卡上的名字更重要。它用同一個 Rails brief 比較\u003Ca href=\"\u002Fnews\u002Fopen-source-ai-music-generators-self-hosted-zh\">開源\u003C\u002Fa>與商業 LLM，並以標準化 metadata、原始 log、以及 0 到 100 的 rubric 評分，重點放在交付物、API 正確性、測試、錯誤處理、持續性、Hotwire、架構與 production readiness。這種設計讓一件事變得非常清楚：同一個模型在不同\u003Ca href=\"\u002Fnews\u002Fllm-fine-tuning-production-2026-zh\">環境\u003C\u002Fa>裡可以看起來很強，也可以直接失手；反過來，較便宜的模型只要流程更緊，反而能贏過名氣更大的對手。\u003C\u002Fp>\u003Ch2>第一個論點\u003C\u002Fh2>\u003Cp>這個 benchmark 最有力的地方，是它評估的是能不能真的上線，而不是 benchmark 表演。模型如果寫很多檔案，卻把 RubyLLM API 幻覺成不存在的介面，就會被扣分；模型如果測試數量較少，但用對正確的 signature、處理錯誤、驗證 boot 行為，反而會拿到更高分。repo 自己就點名過這種差異：Kimi K2.5 據稱寫了 37 個測試，卻沒有正確 mock RubyLLM；Gemini 3.1 Pro 只寫了 11 個測試，卻用了正確的 FakeChat 簽名，因此在測試品質上更高。\u003C\u002Fp>\n\u003Cfigure class=\"my-6\">\u003Cimg src=\"https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1782253062596-f192.png\" alt=\"這個 coding benchmark 證明：harness 品質勝過模型光環\" class=\"rounded-xl w-full\" loading=\"lazy\" \u002F>\u003C\u002Ffigure>\n\u003Cp>這代表一個很實際的結論：只看輸出量的 benchmark 很容易被騙，只看正確性的 benchmark 才能活得久。這個 rubric 把幻覺 API 視為失敗，即使測試全綠也不放過，因為建立在假介面上的綠燈根本沒有價值。對 coding agent 來說，錯一個 method call、寫錯一個 client class，問題不會出現在 markdown 摘要裡，而是出現在 app boot、compose 啟動，或第一個真實 request 打進來的時候。\u003C\u002Fp>\u003Ch2>第二個論點\u003C\u002Fh2>\u003Cp>這個 repo 更重要的發現不是某個模型永遠最好，而是同一個模型的品質會被 orchestration layer 大幅改寫。README 直接指出，同一個 \u003Ca href=\"\u002Ftag\u002Fopus-47\">Opus 4.7\u003C\u002Fa> 在 opencode 裡產出 Tier A \u003Ca href=\"\u002Fnews\u002Fcodex2api-local-deploy-risk-control-notes-zh\">code\u003C\u002Fa>，但在 \u003Ca href=\"\u002Ftag\u002Fclaude-code\">Claude Code\u003C\u002Fa> 裡只到 Tier 2 或 3，原因是後者環境中它幻覺出了 chat.complete。這不是小差異，而是說明周邊 agent loop 會保留或扭曲模型對任務的推理能力。\u003C\u002Fp>\u003Cp>DeepSeek V4 Pro 的例子更有說服力。它在 opencode 裡一開始甚至無法衡量，因為有 reasoning_content 的 interop bug；改走 \u003Ca href=\"\u002Ftag\u002Fclaude\">Claude\u003C\u002Fa> Code，再透過 deepclaude env-swap shim 和 OpenRouter 的 \u003Ca href=\"\u002Ftag\u002Fanthropic\">Anthropic\u003C\u002Fa>-compatible endpoint 後，才進到 Tier A，分數達 84 與 89。還是同一個模型，卻因為 harness 不同，結果差很多。這表示 benchmark operator 不是中立觀察者，而是實驗的一部分，他們的實作選擇會直接改變結果。\u003C\u002Fp>\u003Ch2>反方可能怎麼說\u003C\u002Fh2>\u003Cp>這套說法有一個合理的反駁：如果 benchmark 只圍繞一個 Rails app、一組 prompt family、以及一個 agent stack，那它很容易過度貼合評估者的偏好。0 到 100 的 rubric 雖然嚴謹，仍然是人為 rubric。不同團隊在意的東西不同，創業公司可能更重視速度與部分正確，平台團隊可能更重視可維護性，而 benchmark 未必能完整捕捉。\u003C\u002Fp>\n\u003Cfigure class=\"my-6\">\u003Cimg src=\"https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1782253064672-hhdz.png\" alt=\"這個 coding benchmark 證明：harness 品質勝過模型光環\" class=\"rounded-xl w-full\" loading=\"lazy\" \u002F>\u003C\u002Ffigure>\n\u003Cp>這個批評最強的地方，在於不要把 benchmark 誤當成產品評估的全部。它不是。它是一個受控壓力測試，而任何受控壓力測試都一定會簡化現實。\u003C\u002Fp>\u003Cp>但這個反方觀點，並沒有推翻 repo 的核心主張。benchmark 不需要模擬所有 production 環境，也能揭露哪些模型會幻覺 API、哪些模型能撐過 boot validation、哪些模型能讓 persistence 和 tests 跟 brief 對齊。這些是通用失敗模式，不是小眾偏好。這個 repo 的價值，在於它把雜訊壓低到足以看見 correctness 的差異；一旦模型在真介面或真實 compose 檢查上失敗，再多 benchmark 懷疑論也救不了它進入嚴肅的 coding 工作。\u003C\u002Fp>\u003Ch2>你能做什麼\u003C\u002Fh2>\u003Cp>如果你是工程師，別再只看 leaderboard 名次，改成要求 harness 透明：檢查 prompt、驗證步驟、runtime checks、以及失敗模式。如果你是 PM 或創辦人，選模型與 agent 時，要看它在你自己的 stack 裡能不能端到端產出正確 code，而不是看通用 hype。真正該問的不是「哪個模型分數最高」，而是「哪個模型能在我的環境、用我的工具、以我能辯護的成本，交出正確的程式碼」。這個 repo 的答案很明確：決定因素不是模型名字，而是 benchmark 對它施加的紀律。\u003C\u002Fp>","這篇主張：評估 coding 模型時，決定結果的不是模型品牌，而是 benchmark harness 的設計品質。","github.com","https:\u002F\u002Fgithub.com\u002Fakitaonrails\u002Fllm-coding-benchmark",null,"https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1782253062596-f192.png","ai-agent","zh","8dbcd7ac-bae7-46c1-ba11-bdca1fd774e8",[17,18,19,20,21],"coding benchmark","harness quality","LLM coding agents","model evaluation","benchmark design",[23,24,25,26],"coding benchmark 的排名高度依賴 harness 設計，而不只是模型品牌","正確性、boot 驗證與真實 API 介面，比輸出量或測試數量更能反映可上線性","同一模型在不同 agent orchestration 下，表現可能從 Tier A 掉到中低階","選模型時應先看你的環境中的端到端正確性，再看成本與名氣",0,"2026-06-23T22:17:21.208723+00:00","2026-06-23T22:17:21.196+00:00","e3b68196-9e64-4c18-a3b6-a73e73bfb367",{"tags":32,"relatedLang":33,"relatedPosts":37},[],{"id":15,"slug":34,"title":35,"language":36},"benchmark-harness-quality-beats-model-hype-coding-en","This benchmark proves harness quality beats model hype in coding","en",[38,44,50,56,62,68],{"id":39,"slug":40,"title":41,"cover_image":42,"image_url":42,"created_at":43,"category":13},"bd553163-18b3-46ba-b285-2a87d2ebbb71","glm-5-kill-vibe-coding-agent-engineering-zh","GLM-5 對了：該殺掉 vibe coding，改做 agent engin…","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1782223378474-8fp8.png","2026-06-23T14:02:23.769355+00:00",{"id":45,"slug":46,"title":47,"cover_image":48,"image_url":48,"created_at":49,"category":13},"c615cb9a-1006-4f70-ae81-c0bc61b85dee","loop-engineering-claude-code-workflow-zh","Loop Engineering：Claude Code 的新工作法","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1782205389495-3rvj.png","2026-06-23T09:02:37.400033+00:00",{"id":51,"slug":52,"title":53,"cover_image":54,"image_url":54,"created_at":55,"category":13},"b3231c66-e646-4d3c-8e7a-54e761e9b891","fable-5-ban-model-routing-race-zh","Fable 5 封鎖暴露模型路由賽局","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1782145076193-i2y3.png","2026-06-22T16:17:25.211477+00:00",{"id":57,"slug":58,"title":59,"cover_image":60,"image_url":60,"created_at":61,"category":13},"cffe7c8f-87e9-4b0f-8846-bab013c737ff","myseum-scanon-privacy-first-moderation-bet-zh","Myseum 與 Scanon 的合作，是隱私優先審核的合理押注","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1782029864265-gmjj.png","2026-06-21T08:17:20.167199+00:00",{"id":63,"slug":64,"title":65,"cover_image":66,"image_url":66,"created_at":67,"category":13},"98c0c178-9d3c-42d6-b4c9-afee24f127db","ai-code-review-rollout-with-human-oversight-zh","AI 程式碼審查落地且不降品質","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1782025372703-pyzb.png","2026-06-21T07:02:25.569045+00:00",{"id":69,"slug":70,"title":71,"cover_image":72,"image_url":72,"created_at":73,"category":13},"6940e45a-5ea6-4e88-a6ec-4fd6c4e98546","crypto-ai-agents-hidden-model-risk-zh","Crypto AI 代理的隱藏模型風險","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1782023574666-eas6.png","2026-06-21T06:32:27.289175+00:00",[75,80,85,90,95,100,105,110,115,120],{"id":76,"slug":77,"title":78,"created_at":79},"4ae1e197-1d3d-4233-8733-eafe9cb6438b","claude-now-uses-your-pc-to-finish-tasks-zh","Claude 開始幫你操作電腦","2026-03-26T07:20:48.457387+00:00",{"id":81,"slug":82,"title":83,"created_at":84},"5bede67f-e21c-413d-9ab8-54a3c3d26227","googles-2026-ai-agent-report-decoded-zh","Google 2026 AI Agent 報告解讀","2026-03-26T11:15:22.651956+00:00",{"id":86,"slug":87,"title":88,"created_at":89},"2987d097-563f-46c7-b76f-b558d8ef7c2b","kimi-k25-review-stronger-still-not-legend-zh","Kimi K2.5 評測：更強，但還不是神作","2026-03-27T07:15:55.277513+00:00",{"id":91,"slug":92,"title":93,"created_at":94},"95c9053b-e3f4-4cb5-aace-5c54f4c9e044","claude-code-controls-mac-desktop-zh","Claude Code 也能操控 Mac 了","2026-03-28T03:01:58.58121+00:00",{"id":96,"slug":97,"title":98,"created_at":99},"dc58e153-e3a8-4c06-9b96-1aa64eabbf5f","cloudflare-100x-faster-ai-agent-sandbox-zh","Cloudflare 的 AI 沙箱跑超快","2026-03-28T03:09:44.142236+00:00",{"id":101,"slug":102,"title":103,"created_at":104},"1c8afc56-253f-47a2-979f-1065ff072f2a","openai-backs-isara-agent-swarm-bet-zh","OpenAI 挺 Isara 的 agent swarm …","2026-03-28T03:15:27.513155+00:00",{"id":106,"slug":107,"title":108,"created_at":109},"7379b422-576e-45df-ad5a-d57a0d9dd467","openai-plan-automated-ai-researcher-zh","OpenAI 想做自動化 AI 研究員","2026-03-28T03:17:42.090548+00:00",{"id":111,"slug":112,"title":113,"created_at":114},"48c9889e-86df-450b-a356-e4a4b7c83c5b","harness-engineering-ai-agent-reliability-2026-zh","駕馭工程：從「馬具」到「作業系統」，AI Agent 可靠性的終極密碼","2026-03-31T06:42:53.556721+00:00",{"id":116,"slug":117,"title":118,"created_at":119},"96d8e8c8-1edd-475d-9145-b1e7a1b02b65","mcp-explained-from-prompts-to-production-zh","MCP 怎麼把提示詞變工作流","2026-04-01T09:24:39.321274+00:00",{"id":121,"slug":122,"title":123,"created_at":124},"f2ca7720-b471-4ce5-9336-2a9ac2a876fd","amazon-bedrock-agents-multi-agent-workflows-zh","Amazon Bedrock Agents 進入多代理工作流","2026-04-01T09:30:29.945429+00:00"]