[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"article-llms-procedural-execution-diagnostic-study-zh":3,"article-related-llms-procedural-execution-diagnostic-study-zh":25,"series-research-140a1bc8-8432-4950-9ed7-f28ea3060068":78},{"id":4,"slug":5,"title":6,"content":7,"summary":8,"source":9,"source_url":10,"author":11,"image_url":12,"cover_image":12,"category":13,"language":14,"translated_content":11,"related_article_id":15,"keywords":16,"key_takeaways":11,"views":22,"created_at":23,"published_at":24,"topic_cluster_id":11},"140a1bc8-8432-4950-9ed7-f28ea3060068","llms-procedural-execution-diagnostic-study-zh","LLM 會算，但不一定照步驟做","\u003Cp data-speakable=\"summary\">這篇研究在測 \u003Ca href=\"\u002Ftag\u002Fllm\">LLM\u003C\u002Fa> 能不能照步驟執行指令，而不是只看最後答案對不對。\u003C\u002Fp>\u003Cp>很多 LLM 評測都盯著 final answer。這很方便，但也可能遮住一個更基礎的問題：\u003Ca href=\"\u002Fnews\u002Fhycop-modular-interpretable-pde-surrogates-zh\">模型\u003C\u002Fa>看起來會解題，卻沒有真的照著流程做。\u003Ca href=\"https:\u002F\u002Farxiv.org\u002Fabs\u002F2605.00817\">When LLMs Stop Following Steps: A Diagnostic Study of Procedural Execution in Language Models\u003C\u002Fa> 就是直接抓這個落差，檢查模型能不能把簡單的算術程序按原樣跑完。\u003C\u002Fp>\u003Cp>這篇論文真正關心的，不是「模型會不會算」，而是「模型有沒有照做」。這個差別很重要。只要工作流程依賴固定步驟、狀態更新、或中間值傳遞，模型一旦跳步、提早收尾、或自己多加操作，最後答案就可能錯得很安靜。\u003C\u002Fp>\u003Ch2>這篇在補哪個洞\u003C\u002Fh2>\u003Cp>作者鎖定的是常見 benchmark 的盲點。最後答案正確，只能證明結果對；不能證明過程有被忠實執行。對開發者來說，這個差異很現實，因為很多 LLM 應用本來就是程序型任務：先解析輸入，再更新變數，接著依序套規則，最後輸出結果。\u003C\u002Fp>\n\u003Cfigure class=\"my-6\">\u003Cimg src=\"https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1777875651857-35bu.png\" alt=\"LLM 會算，但不一定照步驟做\" class=\"rounded-xl w-full\" loading=\"lazy\" \u002F>\u003C\u002Ffigure>\n\u003Cp>在這種情境下，模型就算偶爾靠捷徑答對，也不代表可靠。它可能在短流程表現正常，但一旦步驟變長、需要保留中間值、或輸出必須反映完整操作順序，就開始失真。這篇研究就是要把這個風險量化出來。\u003C\u002Fp>\u003Cp>論文使用的是一個診斷型 benchmark。任務本身刻意保持簡單：模型拿到一個分步的算術演算法，再加上兩個數字輸入，最後要回傳算出的結果。難點不在數學，而在程序長度變長，以及步驟之間有前後依賴。\u003C\u002Fp>\u003Ch2>方法怎麼做，白話版\u003C\u002Fh2>\u003Cp>這個 benchmark 的設計重點，是把「忠實執行指令」和「猜對答案」拆開。它不是要測廣泛推理能力，而是要看模型能不能按指定演算法逐步跑。這樣一來，研究者比較容易看出模型是在追蹤流程，還是在偷懶猜結果。\u003C\u002Fp>\u003Cp>有兩個設計很關鍵。第一，算術本身很簡單，所以不是在考高難度計算。第二，程序會越來越長，而且某些步驟要回頭依賴前面算出的中間值。這就形成一個控制良好的壓力測試：流程一拉長，模型還能不能維持一致的執行軌跡。\u003C\u002Fp>\u003Cp>這篇研究總共評估 14 個模型、55 個 datasets。原始摘要沒有提供更多 benchmark 細節，所以沒有其他數字可以再延伸。不過，這樣的設定已經足夠看出一個趨勢：程序越長，模型越容易失去忠實度。\u003C\u002Fp>\u003Cul>\u003Cli>輸入：分步算術演算法與兩個數值\u003C\u002Fli>\u003Cli>任務：回傳最後計算結果\u003C\u002Fli>\u003Cli>壓力來源：更長的流程、前後依賴的中間值\u003C\u002Fli>\u003Cli>規模：14 個模型、55 個 datasets\u003C\u002Fli>\u003C\u002Ful>\u003Ch2>結果到底說了什麼\u003C\u002Fh2>\u003Cp>最直接的結果，是 first-answer accuracy 隨著程序變長而大幅下滑。跨 14 個模型與 55 個 datasets，平均 first-answer accuracy 從 5-step procedures 的 61%，掉到 95-step procedures 的 20%。對一個算術本身不難的任務來說，這個落差很大。\u003C\u002Fp>\n\u003Cfigure class=\"my-6\">\u003Cimg src=\"https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1777875644936-ooat.png\" alt=\"LLM 會算，但不一定照步驟做\" class=\"rounded-xl w-full\" loading=\"lazy\" \u002F>\u003C\u002Ffigure>\n\u003Cp>這代表問題不只是「題目太難」。模型更像是在維持執行軌跡時失手了。也就是說，短流程時看起來還行，步驟一多、依賴一深，可靠度就明顯下降。\u003C\u002Fp>\u003Cp>作者也分析了 generation-level 的失敗模式，讓結果比單一正確率更有畫面。文中提到幾種反覆出現的模式：missing answers、premature answers、self-correction after an initial error、un\u003Ca href=\"\u002Fnews\u002Fgithub-copilot-code-review-actions-minutes-zh\">der\u003C\u002Fa>-executed traces，以及 hallucinated extra steps。這些都不是小瑕疵，而是模型明顯偏離原始程序的訊號。\u003C\u002Fp>\u003Cp>摘要沒有提供更細的 benchmark 分項，也沒有更完整的表格數字。換句話說，這是一篇診斷研究，不是那種把各種系統性能一口氣攤開的全面評測。\u003C\u002Fp>\u003Ch2>對開發者有什麼影響\u003C\u002Fh2>\u003Cp>如果你把 LLM 放進需要精準步驟順序的流程，這篇研究是個警訊。模型可能在推理型 benchmark 看起來很強，但一旦要求它忠實執行程序，表現就不一定穩。這包含結構化資料轉換、規則式工作流、多步驟計算，或任何需要保留中間狀態的 prompt。\u003C\u002Fp>\u003Cp>對工程團隊來說，重點不是不用 LLM，而是不要把「答案看起來對」和「真的照程序做」混為一談。只檢查最後輸出，很容易漏掉提早結束、跳過步驟、或自己補出不存在操作的情況。這些錯誤一旦進到自動化流程，成本可能不低。\u003C\u002Fp>\u003Cp>這篇研究也有它的限制。它測的是算術程序，所以是受控的診斷情境，不是完整的真實世界工作流。摘要沒有主張更大範圍的產品部署結果，也沒有提供超出上述 aggregate accuracy 與失敗類型以外的 benchmark 細節。所以它最適合被讀成一個具體弱點的證據，而不是對 LLM 推理能力的總結判決。\u003C\u002Fp>\u003Cp>但核心訊息很清楚：最後答案正確，不代表過程有被忠實執行。只要你的應用在乎流程一致性，就不能只看單次生成結果。這篇研究提供了一個很直接的理由，去做更多 guardrails。\u003C\u002Fp>\u003Cp>實務上，最值得做的事，是直接測 step fidelity。只要 prompt 或 workf\u003Ca href=\"\u002Fnews\u002Fcloudflare-ai-code-review-prompt-injection-zh\">lo\u003C\u002Fa>w 裡有順序，就不要假設模型有照著走，除非你真的驗過。這篇研究顯示，流程一拉長，可靠度會掉得很快，即使底層任務本身簡單到讓人以為很安全。\u003C\u002Fp>\u003Cp>換句話說，LLM 不只是會不會答對的問題，還有會不會老實照做的問題。對想把它接進產品的人來說，這篇論文提醒得很實際：如果流程不能錯，光靠一個生成結果通常不夠。\u003C\u002Fp>","這篇診斷研究直接測 LLM 能不能照程序一步一步執行。結果顯示，步驟一拉長，模型的程序忠實度就明顯下滑，算術本身卻不難。","arxiv.org","https:\u002F\u002Farxiv.org\u002Fabs\u002F2605.00817",null,"https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1777875651857-35bu.png","research","zh","f414aa1a-27e8-45d9-b407-d542121915d2",[17,18,19,20,21],"LLM","procedural execution","instruction following","diagnostic benchmark","step fidelity",5,"2026-05-04T06:20:26.283075+00:00","2026-05-04T06:20:26.192+00:00",{"tags":26,"relatedLang":37,"relatedPosts":41},[27,29,31,33,35],{"name":21,"slug":28},"step-fidelity",{"name":17,"slug":30},"llm",{"name":20,"slug":32},"diagnostic-benchmark",{"name":18,"slug":34},"procedural-execution",{"name":19,"slug":36},"instruction-following",{"id":15,"slug":38,"title":39,"language":40},"llms-procedural-execution-diagnostic-study-en","When LLMs Stop Following Procedural Steps","en",[42,48,54,60,66,72],{"id":43,"slug":44,"title":45,"cover_image":46,"image_url":46,"created_at":47,"category":13},"a4cf24e5-b958-4f91-bdca-2f1a57e81aef","why-benchmark-leaderboards-are-wrong-about-model-logic-zh","為什麼基準排行榜看錯了模型邏輯","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1780673571153-x7yi.png","2026-06-05T15:32:23.043639+00:00",{"id":49,"slug":50,"title":51,"cover_image":52,"image_url":52,"created_at":53,"category":13},"4a829d2a-24a3-42dd-8be4-49e5ab35435a","why-prompt-engineering-is-wrong-about-2026-zh","為什麼 2026 年 prompt engineering 錯了","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1780661884287-ow45.png","2026-06-05T12:17:19.813402+00:00",{"id":55,"slug":56,"title":57,"cover_image":58,"image_url":58,"created_at":59,"category":13},"52a37532-880d-4261-8f62-2f254d6c592d","spire-evidence-grounded-ai-humanities-zh","SPIRE 讓人文 AI 更重證據","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1780647483844-bcuj.png","2026-06-05T08:17:29.603104+00:00",{"id":61,"slug":62,"title":63,"cover_image":64,"image_url":64,"created_at":65,"category":13},"b38c56a6-e7f3-45fb-b100-d37e7b3ed417","reinforcement-aware-distillation-llm-reasoning-zh","強化感知蒸餾，想把推理一起學進去","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1780646589500-0me6.png","2026-06-05T08:02:33.908932+00:00",{"id":67,"slug":68,"title":69,"cover_image":70,"image_url":70,"created_at":71,"category":13},"60f7d702-20a7-4cec-9a80-185f072c8dfe","next-token-models-plan-ahead-zh","次詞模型其實會先想一步","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1780645684780-roea.png","2026-06-05T07:47:34.35089+00:00",{"id":73,"slug":74,"title":75,"cover_image":76,"image_url":76,"created_at":77,"category":13},"7ec803f7-2658-4c9e-baa6-2b8528407d7f","google-deepmind-co-scientist-researchers-zh","Google DeepMind 對外開放 Co-Scientist","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1780636679231-q694.png","2026-06-05T05:17:30.68789+00:00",[79,84,89,94,99,104,109,114,119,124],{"id":80,"slug":81,"title":82,"created_at":83},"f18dbadb-8c59-4723-84a4-6ad22746c77a","deepmind-bets-on-continuous-learning-ai-2026-zh","DeepMind 押注 2026 連續學習 AI","2026-03-26T08:16:02.367355+00:00",{"id":85,"slug":86,"title":87,"created_at":88},"f4a106cb-02a6-4508-8f39-9720a0a93cee","ml-papers-of-the-week-github-research-desk-zh","每週 ML 論文清單，為何紅到 GitHub","2026-03-27T01:11:39.284175+00:00",{"id":90,"slug":91,"title":92,"created_at":93},"c4f807ca-4e5f-47f1-a48c-961cf3fc44dc","ai-ml-conferences-to-watch-in-2026-zh","2026 AI 研討會投稿時程整理","2026-03-27T01:51:53.874432+00:00",{"id":95,"slug":96,"title":97,"created_at":98},"cf046742-efb2-4753-aef9-caed5da5e32e","adaptive-block-scaled-data-types-zh","IF4：神經網路量化的聰明選擇","2026-03-31T06:00:36.990273+00:00",{"id":100,"slug":101,"title":102,"created_at":103},"53a0dc54-0371-4e40-8d5e-74e94a73840c","geometry-aware-similarity-metrics-for-neural-representations-zh","超越距離測量：用微分幾何重新理解神經網路","2026-03-31T06:01:01.241968+00:00",{"id":105,"slug":106,"title":107,"created_at":108},"fee7d472-a775-4b1d-bbc2-1e8bca1bbf8b","on-the-fly-repulsion-in-the-contextual-space-for-rich-divers-zh","讓AI繪圖更有創意：用排斥力提升生成多樣性","2026-03-31T06:01:25.439673+00:00",{"id":110,"slug":111,"title":112,"created_at":113},"a9901203-d69b-447b-8854-15d14eab32b4","vision-aided-beam-prediction-cnn-eca-zh","影像輔助波束預測升級 CNN","2026-04-01T10:00:25.8073+00:00",{"id":115,"slug":116,"title":117,"created_at":118},"b55e7dd4-0a24-4b3d-804d-b0309a03f498","triple-band-fss-mimo-antenna-sub-6-ghz-zh","三頻 FSS MIMO 天線瞄準 sub-6 GHz","2026-04-01T13:18:36.857305+00:00",{"id":120,"slug":121,"title":122,"created_at":123},"f68290bd-e7f3-4b30-ba22-dcd4e0130a66","openclaw-1299-repos-eight-weeks-analysis-zh","OpenClaw 1299 個 Repo 的資料解讀","2026-04-02T05:03:45.208411+00:00",{"id":125,"slug":126,"title":127,"created_at":128},"ed9f80eb-eb02-4d35-8ad4-0ddf428751dd","beam-coherence-aware-combining-mmwave-mimo-zh","毫米波 MIMO 的雙階合併法","2026-04-02T05:27:26.897188+00:00"]