[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"article-eagle3-real-speedup-kimi-k25-mi325x-zh":3,"article-related-eagle3-real-speedup-kimi-k25-mi325x-zh":31,"series-research-37acb4f1-36aa-4cbd-8c2f-0733c39a074f":74},{"id":4,"slug":5,"title":6,"content":7,"summary":8,"source":9,"source_url":10,"author":11,"image_url":12,"cover_image":12,"category":13,"language":14,"translated_content":11,"related_article_id":15,"keywords":16,"key_takeaways":23,"views":27,"created_at":28,"published_at":29,"topic_cluster_id":30},"37acb4f1-36aa-4cbd-8c2f-0733c39a074f","eagle3-real-speedup-kimi-k25-mi325x-zh","EAGLE3 才是 Kimi-K2.5 在 MI325X 上真正的加速器","\u003Cp data-speakable=\"summary\">Kimi-K2.5-W4A8 在 AMD MI325X 上變快，主因是 EAGLE3 的 speculative decoding，不是 kernel 微調。\u003C\u002Fp>\u003Cp>我認為，Kimi-K2.5-W4A8 在 AMD MI325X 上的主要加速來源是 EAGLE3，而不是 kernel tweaks。ROCm \u003Ca href=\"\u002Ftag\u002Fbenchmark\">benchmark\u003C\u002Fa> 已經把差異講得很清楚：在 8× MI325X、concurrency 40 的條件下，加入 EAGLE3 後，TPOT median 從 42.73 ms 降到 27.79 ms，吞吐從 672.30 tok\u002Fs 升到 872.58 tok\u002Fs；後面的 kernel patches 只是再補一小段。這代表瓶頸不在「算子還能不能再磨一點」，而在 decode 本身的序列化結構。\u003C\u002Fp>\u003Ch2>第一個論點\u003C\u002Fh2>\u003Cp>先看最核心的事實：自回歸 decode 本來就是一個 \u003Ca href=\"\u002Ftag\u002Ftoken\">token\u003C\u002Fa> 接一個 token 地走。對 Kimi-K2.5 這類 \u003Ca href=\"\u002Ftag\u002Fmoe\">MoE\u003C\u002Fa> 模型來說，即使 W4A8 已經把權重與算力路徑壓得很緊，每生成一個 token 仍要付出一次完整 forward、\u003Ca href=\"\u002Ftag\u002Fkv-cache\">KV cache\u003C\u002Fa> 存取、路由與 sampling 的成本。這種成本不是靠再調一點 kernel tile 就能消掉的，因為它來自流程本身。\u003C\u002Fp>\n\u003Cfigure class=\"my-6\">\u003Cimg src=\"https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1782640968852-4e46.png\" alt=\"EAGLE3 才是 Kimi-K2.5 在 MI325X 上真正的加速器\" class=\"rounded-xl w-full\" loading=\"lazy\" \u002F>\u003C\u002Ffigure>\n\u003Cp>EAGLE3 改的是工作單位，不是單一算子。它讓 draft model 一次提出短鏈，target model 再在一個 pass 裡驗證整段序列。文中的設定是三個 speculative steps、四個 draft tokens，平均接受長度接近上限 3.93 \u002F 4.0。這個數字很關鍵，因為它表示大部分 draft token 都能被接受，verify 的成本被攤平到多個 token 上，decode loop 也從逐 token 執行變成批次驗證。\u003C\u002Fp>\u003Ch2>第二個論點\u003C\u002Fh2>\u003Cp>最有說服力的證據，是 EAGLE3 在還沒加任何額外調整前，就已經帶來明顯提升。8× MI325X、concurrency 40 的 baseline 下，W4A8 without EAGLE3 的 TPOT median 是 42.73 ms、output throughput 是 672.30 tok\u002Fs；啟用 EAGLE3 baseline 後，TPOT median 降到 27.79 ms，throughput 升到 872.58 tok\u002Fs。這不是邊際改善，而是足以改變服務體感的幅度。\u003C\u002Fp>\u003Cp>更重要的是，收益集中在使用者真正感受到的 decode 區段。ITL median 從 27.98 ms 降到 11.75 ms，而 TTFT 幾乎沒變，這正符合 speculative decoding 的特性：它\u003Ca href=\"\u002Fnews\u002Fai-workforce-split-not-permanent-caste-system-zh\">不會\u003C\u002Fa>改變 prefill，但會把輸出階段的等待時間壓下來。換句話說，如果你的痛點是「第一個 token 出來之後還是太慢」，那 EAGLE3 對症下藥；如果你只盯著 prefill，就會誤判這個方案的價值。\u003C\u002Fp>\u003Ch2>反方可能怎麼說\u003C\u002Fh2>\u003Cp>最強的反方論點不是說 EAGLE3 沒用，而是說它太複雜。你需要 draft model、需要額外的 serving 參數、需要調整 draft depth 和 width，還要管理 target 與 draft 的對應關係。對重視穩定與可維護性的團隊來說，單純的 W4A8 decode 路徑更容易部署，也更容易排障。若 draft 品質不好，還可能白白多做計算，反而吃掉收益。\u003C\u002Fp>\n\u003Cfigure class=\"my-6\">\u003Cimg src=\"https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1782640966658-bpo2.png\" alt=\"EAGLE3 才是 Kimi-K2.5 在 MI325X 上真正的加速器\" class=\"rounded-xl w-full\" loading=\"lazy\" \u002F>\u003C\u002Ffigure>\n\u003Cp>另一個合理批評是可移植性有限。EAGLE3 的 draft 是針對特定 target 訓練的，不是拿來就能套到所有模型上。若你的產品線有很多不同模型，或你無法長期維持 draft-target 配對，這種方法的管理成本確實高於一般 kernel 優化。從平台視角看，這不是一個「到處都能複製」的技巧。\u003C\u002Fp>\u003Cp>但這些限制並不能推翻結論，因為本文的問題本來就不是「哪個技巧最通用」，而是「Kimi-K2.5 在 MI325X 上的主加速來自哪裡」。數據已經說明，EAGLE3 單獨就能把 TPOT \u003Ca href=\"\u002Fnews\u002Fimmutable-x-cuts-nft-game-fees-ethereum-zh\">壓低\u003C\u002Fa>、把 throughput 拉高，而且沒有可觀 accuracy regression。kernel patches 只再帶來約 1% 到 2% 的 TPOT 改善與 2% 到 3% 的 throughput 增益，這表示它們是錦上添花，不是主因。當 decode 的瓶頸是序列化，改變 decode 幾何才是正解。\u003C\u002Fp>\u003Ch2>你能做什麼\u003C\u002Fh2>\u003Cp>如果你是工程師，先把 EAGLE3 當成第一優先級，確認 draft-target 配對、accept length、concurrency 與實際 throughput，再去做 Stage2 MoE tile、Stage1 scheduler-hint、bf16 round-to-zero 這類小幅優化；如果你是 PM 或創辦人，請把這件事當成一個訊號：\u003Ca href=\"\u002Fnews\u002Fdatabricks-external-model-endpoints-governance-zh\">模型服務\u003C\u002Fa>性能常常不是靠「再磨 kernel」贏的，而是靠改變算法的工作單位。當 decode 佔主導時，優先驗證多個 token，而不是只把單 token 路徑磨得更漂亮。\u003C\u002Fp>","我認為 Kimi-K2.5-W4A8 在 AMD MI325X 上變快，主因是 EAGLE3 的 speculative decoding，不是 kernel 微調；真正改變的是解碼幾何，而不是單一算子效率。","rocm.blogs.amd.com","https:\u002F\u002Frocm.blogs.amd.com\u002Fartificial-intelligence\u002Fkimi-k2.5-speculative\u002FREADME.html",null,"https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1782640968852-4e46.png","research","zh","6dcd4b03-8352-43b0-969a-c030e48afb3c",[17,18,19,20,21,22],"EAGLE3","Kimi-K2.5","AMD MI325X","speculative decoding","W4A8","kernel optimization",[24,25,26],"EAGLE3 才是 Kimi-K2.5 在 MI325X 上的主加速來源。","真正的收益來自把逐 token decode 改成批次驗證。","Kernel 微調有幫助，但只是建立在 EAGLE3 之上的小幅增益。",0,"2026-06-28T10:02:26.13691+00:00","2026-06-28T10:02:26.123+00:00","0c35a120-52fc-41fc-afa3-d404eb934158",{"tags":32,"relatedLang":33,"relatedPosts":37},[],{"id":15,"slug":34,"title":35,"language":36},"eagle3-real-speedup-kimi-k25-mi325x-en","EAGLE3 is the real speedup for Kimi-K2.5 on MI325X","en",[38,44,50,56,62,68],{"id":39,"slug":40,"title":41,"cover_image":42,"image_url":42,"created_at":43,"category":13},"5431a65e-76da-4a2a-96c5-73a6a7635903","cuda-toolkit-13-3-fixes-nested-divergence-bug-zh","CUDA 13.3 修掉巢狀分歧編譯錯誤","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1782676982948-afr9.png","2026-06-28T20:02:39.341994+00:00",{"id":45,"slug":46,"title":47,"cover_image":48,"image_url":48,"created_at":49,"category":13},"7c4c30b3-b2a8-48a7-b2ea-96c40c16ae19","llm-fine-tuning-turns-generic-models-into-domain-tools-zh","LLM 微調把通用模型變專用工具","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1782569910494-nhtn.png","2026-06-27T14:17:56.614064+00:00",{"id":51,"slug":52,"title":53,"cover_image":54,"image_url":54,"created_at":55,"category":13},"cd8b1802-2094-4f5c-89a9-230680124777","mistral-ocr-4-document-ai-structure-zh","Mistral OCR 4 把文件變結構化資料","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1782468184906-6p2v.png","2026-06-26T10:02:37.422252+00:00",{"id":57,"slug":58,"title":59,"cover_image":60,"image_url":60,"created_at":61,"category":13},"a90ab5b6-f647-4cef-85af-35ff7bb21a93","autoregressive-boltzmann-generators-ditch-flows-zh","ArBG 改用自回歸做分子採樣","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1782455577323-vrvt.png","2026-06-26T06:32:30.056363+00:00",{"id":63,"slug":64,"title":65,"cover_image":66,"image_url":66,"created_at":67,"category":13},"93b19c63-dbfd-4277-92b5-b5a60946fd65","river-llm-reinforcement-learning-without-answers-zh","RiVER 讓 LLM 不靠標準答案也能學","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1782454671897-i8l3.png","2026-06-26T06:17:26.979468+00:00",{"id":69,"slug":70,"title":71,"cover_image":72,"image_url":72,"created_at":73,"category":13},"cd38b72e-b309-493d-b36f-684745ff5f7e","danceopd-on-policy-generative-field-distillation-zh","DanceOPD：把修圖技能蒸餾進同一模型","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1782453784592-x1gk.png","2026-06-26T06:02:33.123618+00:00",[75,80,85,90,95,100,105,110,115,120],{"id":76,"slug":77,"title":78,"created_at":79},"f18dbadb-8c59-4723-84a4-6ad22746c77a","deepmind-bets-on-continuous-learning-ai-2026-zh","DeepMind 押注 2026 連續學習 AI","2026-03-26T08:16:02.367355+00:00",{"id":81,"slug":82,"title":83,"created_at":84},"f4a106cb-02a6-4508-8f39-9720a0a93cee","ml-papers-of-the-week-github-research-desk-zh","每週 ML 論文清單，為何紅到 GitHub","2026-03-27T01:11:39.284175+00:00",{"id":86,"slug":87,"title":88,"created_at":89},"c4f807ca-4e5f-47f1-a48c-961cf3fc44dc","ai-ml-conferences-to-watch-in-2026-zh","2026 AI 研討會投稿時程整理","2026-03-27T01:51:53.874432+00:00",{"id":91,"slug":92,"title":93,"created_at":94},"cf046742-efb2-4753-aef9-caed5da5e32e","adaptive-block-scaled-data-types-zh","IF4：神經網路量化的聰明選擇","2026-03-31T06:00:36.990273+00:00",{"id":96,"slug":97,"title":98,"created_at":99},"53a0dc54-0371-4e40-8d5e-74e94a73840c","geometry-aware-similarity-metrics-for-neural-representations-zh","超越距離測量：用微分幾何重新理解神經網路","2026-03-31T06:01:01.241968+00:00",{"id":101,"slug":102,"title":103,"created_at":104},"fee7d472-a775-4b1d-bbc2-1e8bca1bbf8b","on-the-fly-repulsion-in-the-contextual-space-for-rich-divers-zh","讓AI繪圖更有創意：用排斥力提升生成多樣性","2026-03-31T06:01:25.439673+00:00",{"id":106,"slug":107,"title":108,"created_at":109},"a9901203-d69b-447b-8854-15d14eab32b4","vision-aided-beam-prediction-cnn-eca-zh","影像輔助波束預測升級 CNN","2026-04-01T10:00:25.8073+00:00",{"id":111,"slug":112,"title":113,"created_at":114},"b55e7dd4-0a24-4b3d-804d-b0309a03f498","triple-band-fss-mimo-antenna-sub-6-ghz-zh","三頻 FSS MIMO 天線瞄準 sub-6 GHz","2026-04-01T13:18:36.857305+00:00",{"id":116,"slug":117,"title":118,"created_at":119},"f68290bd-e7f3-4b30-ba22-dcd4e0130a66","openclaw-1299-repos-eight-weeks-analysis-zh","OpenClaw 1299 個 Repo 的資料解讀","2026-04-02T05:03:45.208411+00:00",{"id":121,"slug":122,"title":123,"created_at":124},"ed9f80eb-eb02-4d35-8ad4-0ddf428751dd","beam-coherence-aware-combining-mmwave-mimo-zh","毫米波 MIMO 的雙階合併法","2026-04-02T05:27:26.897188+00:00"]