[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"article-longcot-long-horizon-chain-of-thought-benchmark-zh":3,"article-related-longcot-long-horizon-chain-of-thought-benchmark-zh":25,"series-research-2468c20a-c3cf-4004-8981-44934691673a":76},{"id":4,"slug":5,"title":6,"content":7,"summary":8,"source":9,"source_url":10,"author":11,"image_url":12,"cover_image":12,"category":13,"language":14,"translated_content":11,"related_article_id":15,"keywords":16,"key_takeaways":11,"views":22,"created_at":23,"published_at":24,"topic_cluster_id":11},"2468c20a-c3cf-4004-8981-44934691673a","longcot-long-horizon-chain-of-thought-benchmark-zh","LongCoT：測長鏈推理，不只看答案","\u003Cp>多數模型評測，最後都在看同一件事：答案對不對。\u003Ca href=\"https:\u002F\u002Farxiv.org\u002Fabs\u002F2604.14140\">LongCoT: Benchmarking Long-Horizon Chain-of-Thought Reasoning\u003C\u002Fa> 想問的更難。模型能不能在一長串彼此依賴的推理步驟裡，還維持住思路，不中途走偏？\u003C\u002Fp>\u003Cp>這不是學術上的小題大作。只要是代理人、長流程自動化、或需要逐步規劃的工作，一個前面的小失誤，就可能一路滾成最後的大錯。LongCoT 的核心，就是把這種「長鏈推理」單獨拉出來量測。\u003C\u002Fp>\u003Ch2>這篇論文要解的痛點是什麼\u003C\u002Fh2>\u003Cp>作者先指出一個現實：現在的語言模型越來越常被放進複雜任務裡，不再只是回答單題。這時候，成功不只看某一步做得漂不漂亮，而是看模型能不能記住前文、維持計畫、在很長的推理路徑中不偏航。\u003C\u002Fp>\n\u003Cfigure class=\"my-6\">\u003Cimg src=\"https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1776319784084-uldi.png\" alt=\"LongCoT：測長鏈推理，不只看答案\" class=\"rounded-xl w-full\" loading=\"lazy\" \u002F>\u003C\u002Ffigure>\n\u003Cp>但傳統評測常常抓不到這件事。模型可能在短題、單步題、或資訊很集中的問題上表現不錯，卻在需要很多互相牽動步驟的任務裡失手。LongCoT 想補的，就是這個落差。\u003C\u002Fp>\u003Cp>作者的設計思路也很直接：如果每個單一步驟本身都還算可解，那最後失敗時，比較能歸因到「長距離推理能力不夠」，而不是模型連局部子題都不會。換句話說，LongCoT 想分清楚兩件事：模型是「會解題」，還是「能一路把計畫做完」。\u003C\u002Fp>\u003Ch2>LongCoT 到底怎麼設計\u003C\u002Fh2>\u003Cp>LongCoT 是一個可擴充的 benchmark，收錄 2,500 題專家設計的問題。題目涵蓋化學、數學、電腦科學、西洋棋和邏輯。這個範圍很重要，因為它不是只測單一領域的技巧，而是想讓長鏈推理這件事本身，成為主要壓力來源。\u003C\u002Fp>\u003Cp>每題都有短輸入，也都有可驗證的答案。但真正難的地方，不在於 prompt 有多長，而在於解題過程裡，模型得穿過一張由互相依賴步驟組成的圖。這些推理鏈可以拉到數萬到數十萬個 reasoning tokens 的尺度。也就是說，挑戰不是「看到很多字」，而是「在很長的依賴關係中還能保持正確方向」。\u003C\u002Fp>\u003Cp>對開發者來說，這個設計很有辨識度。它刻意把測試焦點放在 long-horizon chain-of-thought，而不是泛泛地測知識量、語料記憶，或單純的模式配對。這讓它更像一個針對長流程任務的壓力測試。\u003C\u002Fp>\u003Cp>因為每一步本身都不是特別難，所以這份 benchmark 不是在問模型能不能做算術、能不能做基礎推理，而是在問：當你已經做對好幾步之後，你還能不能繼續做對下一步，並且一路維持到最後。\u003C\u002Fp>\u003Ch2>論文實際證明了什麼\u003C\u002Fh2>\u003Cp>摘要給出的結果很直接，也很刺眼：在論文公開時，表現最好的模型在 LongCoT 上都還不到 10% 準確率。GPT 5.2 是 9.8%，Gemini 3 \u003Ca href=\"\u002Fnews\u002Fprerl-training-llms-in-pre-train-space-zh\">Pr\u003C\u002Fa>o 是 6.1%。\u003C\u002Fp>\n\u003Cfigure class=\"my-6\">\u003Cimg src=\"https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1776319784136-w32s.png\" alt=\"LongCoT：測長鏈推理，不只看答案\" class=\"rounded-xl w-full\" loading=\"lazy\" \u002F>\u003C\u002Ffigure>\n\u003Cp>這些數字傳達的訊息很明確：前沿模型和真正的長距離推理之間，還有很大的落差。論文主張的重點不是模型完全不會解題，而是它們沒辦法穩定地把推理維持在長時間、長依賴的結構裡。\u003C\u002Fp>\u003Cp>不過，這份摘要沒有公開完整 benchmark 細節。像是各領域分數、不同題型的表現差異、或更細的消融分析，這裡都看不到。所以我們能確定的是 top-line 結果；但要進一步判斷哪一類任務最難、哪種錯誤模式最常見，還需要看論文全文。\u003C\u002Fp>\u003Cp>即便如此，LongCoT 的價值還是很清楚。它提供了一個比較嚴格的量測框架，讓研究者可以追蹤前沿模型到底是在短題變強，還是真的在長鏈推理上也有進步。\u003C\u002Fp>\u003Ch2>對開發者有什麼影響\u003C\u002Fh2>\u003Cp>如果你在做 agent、c\u003Ca href=\"\u002Fnews\u002Fanthropic-mythos-private-bank-risk-fears-zh\">opi\u003C\u002Fa>lot，或任何要跨很多步驟完成的工作流，LongCoT 很像一個提醒：所謂「模型很會推理」，其實不是單一能力。模型可以在局部子問題上看起來很穩，卻在長流程任務中失去一致性。\u003C\u002Fp>\u003Cp>這會直接影響產品設計。評測不能只放單輪問答，或幾個短推理題就結束。真正要上線的系統，最好也包含長距離依賴測試，看看模型會不會在中途漂移、漏掉前文約束、或把原本的計畫做歪。\u003C\u002Fp>\u003Cp>也因此，就算是前沿模型，很多情境下還是得靠 orchestr\u003Ca href=\"\u002Fnews\u002Fspatialevo-self-evolving-3d-spatial-reasoning-zh\">ati\u003C\u002Fa>on、檢查機制、retrieval、以及逐步驗證來補強。LongCoT 不是在說模型沒用，而是在提醒工程師：如果任務很長，單靠一次性生成答案，風險還是很高。\u003C\u002Fp>\u003Cul>\u003Cli>評估長流程 agent 時，別只看最終答案。\u003C\u002Fli>\u003Cli>短題表現好，不代表長鏈推理也可靠。\u003C\u002Fli>\u003Cli>長任務的失敗，常常不是語法錯，而是思路漂移、依賴漏接、或計畫斷掉。\u003C\u002Fli>\u003Cli>如果能驗證中間步驟，就不要只做 final-answer-only 評測。\u003C\u002Fli>\u003C\u002Ful>\u003Ch2>這篇研究的限制與未解問題\u003C\u002Fh2>\u003Cp>這篇論文的摘要把 benchmark 的方向講得很清楚，但也留下不少實作層面的問題。像是這些互相依賴步驟到底怎麼建出來、不同領域的難度怎麼平衡、以及 benchmark 對記憶或表面捷徑有多抗性，摘要都沒有交代。\u003C\u002Fp>\u003Cp>另外，雖然摘要有給出兩個模型的 top-line 分數，但沒有更多完整 benchmark 細節。也就是說，單靠這份來源，我們還不能判斷不同模型家族誰進步比較快，也不能知道某些題型是不是特別容易讓模型崩掉。\u003C\u002Fp>\u003Cp>但這不影響 LongCoT 的核心意義。它不是又一個只看對錯的題庫，而是試著把「長距離、深依賴、持續一致性」這件事量化。這對現在越來越多要跑長流程的 AI 系統來說，很實用。\u003C\u002Fp>\u003Cp>如果這個 benchmark 之後能被更廣泛使用，它可能會成為一個很有參考價值的尺。不是拿來問模型「有沒有答對一次」，而是問：當路很長、依賴很深、前面一個小錯就可能污染後面全部結果時，這個模型還能不能把思路守住。\u003C\u002Fp>\u003Cp>對開發者來說，這篇論文最大的提醒，不是某個新演算法，而是新的評估視角。你要問的問題，可能不只是「模型會不會解題」，而是「模型能不能在長鏈推理裡，持續做對下一步」。\u003C\u002Fp>","LongCoT 用 2,500 題測試模型能否在長鏈、互相依賴的推理步驟中保持一致。GPT 5.2 與 Gemini 3 Pro 仍低於 10%。","arxiv.org","https:\u002F\u002Farxiv.org\u002Fabs\u002F2604.14140",null,"https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1776319784084-uldi.png","research","zh","9f62add5-cae5-47eb-abd5-2e56d0d5698c",[17,18,19,20,21],"LongCoT","chain-of-thought","long-horizon reasoning","benchmark","LLM evaluation",0,"2026-04-16T06:09:22.856744+00:00","2026-04-16T06:09:22.801+00:00",{"tags":26,"relatedLang":35,"relatedPosts":39},[27,29,31,32,34],{"name":21,"slug":28},"llm-evaluation",{"name":17,"slug":30},"longcot",{"name":20,"slug":20},{"name":19,"slug":33},"long-horizon-reasoning",{"name":18,"slug":18},{"id":15,"slug":36,"title":37,"language":38},"longcot-long-horizon-chain-of-thought-benchmark-en","LongCoT Benchmark: 2,500-Probl. Long-Horizon Reasoning","en",[40,46,52,58,64,70],{"id":41,"slug":42,"title":43,"cover_image":44,"image_url":44,"created_at":45,"category":13},"33c9a55c-a8c0-4367-b742-f4567d1e98e3","mathematicians-warn-ai-could-distort-math-zh","數學界警告 AI 會扭曲證明標準","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1780504386035-080l.png","2026-06-03T16:32:29.415063+00:00",{"id":47,"slug":48,"title":49,"cover_image":50,"image_url":50,"created_at":51,"category":13},"5c3cb90f-7efd-426f-8c09-32a303f82be9","humanoid-gpt-zero-shot-motion-tracking-zh","Humanoid-GPT：用 GPT 擴大動作追蹤","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1780469319284-znpc.png","2026-06-03T06:47:34.463464+00:00",{"id":53,"slug":54,"title":55,"cover_image":56,"image_url":56,"created_at":57,"category":13},"e3a4b0f7-03b3-43c6-ae51-906b337c5c2f","ipt-vlms-hidden-space-reasoning-zh","IPT 讓 VLM 更會想像隱藏空間","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1780468394735-1k40.png","2026-06-03T06:32:46.560029+00:00",{"id":59,"slug":60,"title":61,"cover_image":62,"image_url":62,"created_at":63,"category":13},"5fca9fe5-af66-47ce-85f0-0ffe1bee30b9","neuron-selectivity-changes-with-scale-zh","神經元選擇性會隨規模改變","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1780467514422-7oss.png","2026-06-03T06:17:44.126547+00:00",{"id":65,"slug":66,"title":67,"cover_image":68,"image_url":68,"created_at":69,"category":13},"9f9c2a61-d058-4c62-bb88-106e683657f0","nasa-landsat-wild-disturbances-rising-zh","NASA Landsat：野火與風暴變多","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1780448581102-owp0.png","2026-06-03T01:02:37.513233+00:00",{"id":71,"slug":72,"title":73,"cover_image":74,"image_url":74,"created_at":75,"category":13},"3479bdee-21fb-4fda-9572-9394caba01b0","adacodec-predictive-visual-code-video-mllms-zh","AdaCodec 用預測碼壓縮影片 token","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1780381988591-z2sp.png","2026-06-02T06:32:28.249023+00:00",[77,82,87,92,97,102,107,112,117,122],{"id":78,"slug":79,"title":80,"created_at":81},"f18dbadb-8c59-4723-84a4-6ad22746c77a","deepmind-bets-on-continuous-learning-ai-2026-zh","DeepMind 押注 2026 連續學習 AI","2026-03-26T08:16:02.367355+00:00",{"id":83,"slug":84,"title":85,"created_at":86},"f4a106cb-02a6-4508-8f39-9720a0a93cee","ml-papers-of-the-week-github-research-desk-zh","每週 ML 論文清單，為何紅到 GitHub","2026-03-27T01:11:39.284175+00:00",{"id":88,"slug":89,"title":90,"created_at":91},"c4f807ca-4e5f-47f1-a48c-961cf3fc44dc","ai-ml-conferences-to-watch-in-2026-zh","2026 AI 研討會投稿時程整理","2026-03-27T01:51:53.874432+00:00",{"id":93,"slug":94,"title":95,"created_at":96},"cf046742-efb2-4753-aef9-caed5da5e32e","adaptive-block-scaled-data-types-zh","IF4：神經網路量化的聰明選擇","2026-03-31T06:00:36.990273+00:00",{"id":98,"slug":99,"title":100,"created_at":101},"53a0dc54-0371-4e40-8d5e-74e94a73840c","geometry-aware-similarity-metrics-for-neural-representations-zh","超越距離測量：用微分幾何重新理解神經網路","2026-03-31T06:01:01.241968+00:00",{"id":103,"slug":104,"title":105,"created_at":106},"fee7d472-a775-4b1d-bbc2-1e8bca1bbf8b","on-the-fly-repulsion-in-the-contextual-space-for-rich-divers-zh","讓AI繪圖更有創意：用排斥力提升生成多樣性","2026-03-31T06:01:25.439673+00:00",{"id":108,"slug":109,"title":110,"created_at":111},"a9901203-d69b-447b-8854-15d14eab32b4","vision-aided-beam-prediction-cnn-eca-zh","影像輔助波束預測升級 CNN","2026-04-01T10:00:25.8073+00:00",{"id":113,"slug":114,"title":115,"created_at":116},"b55e7dd4-0a24-4b3d-804d-b0309a03f498","triple-band-fss-mimo-antenna-sub-6-ghz-zh","三頻 FSS MIMO 天線瞄準 sub-6 GHz","2026-04-01T13:18:36.857305+00:00",{"id":118,"slug":119,"title":120,"created_at":121},"f68290bd-e7f3-4b30-ba22-dcd4e0130a66","openclaw-1299-repos-eight-weeks-analysis-zh","OpenClaw 1299 個 Repo 的資料解讀","2026-04-02T05:03:45.208411+00:00",{"id":123,"slug":124,"title":125,"created_at":126},"ed9f80eb-eb02-4d35-8ad4-0ddf428751dd","beam-coherence-aware-combining-mmwave-mimo-zh","毫米波 MIMO 的雙階合併法","2026-04-02T05:27:26.897188+00:00"]