[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"article-eagle3-real-speedup-kimi-k25-mi325x-en":3,"article-related-eagle3-real-speedup-kimi-k25-mi325x-en":30,"series-research-6dcd4b03-8352-43b0-969a-c030e48afb3c":73},{"id":4,"slug":5,"title":6,"content":7,"summary":8,"source":9,"source_url":10,"author":11,"image_url":12,"cover_image":12,"category":13,"language":14,"translated_content":11,"related_article_id":15,"keywords":16,"key_takeaways":22,"views":26,"created_at":27,"published_at":28,"topic_cluster_id":29},"6dcd4b03-8352-43b0-969a-c030e48afb3c","eagle3-real-speedup-kimi-k25-mi325x-en","EAGLE3 is the real speedup for Kimi-K2.5 on MI325X","\u003Cp data-speakable=\"summary\">EAGLE3 is the main reason Kimi-K2.5-W4A8 decodes faster on AMD MI325X, not kernel tweaks.\u003C\u002Fp>\u003Cp>Speculative decoding is the right fix for Kimi-K2.5-W4A8 on AMD Instinct MI325X, and EAGLE3 is the part that actually moves the needle. The ROCm \u003Ca href=\"\u002Ftag\u002Fbenchmark\">benchmark\u003C\u002Fa> shows that on 8× MI325X at concurrency 40, adding EAGLE3 cuts TPOT median from 42.73 ms to 27.79 ms and pushes throughput from 672.30 tok\u002Fs to 872.58 tok\u002Fs before any extra tuning. The later kernel patches add only a small increment on top. That is the story: once decode is blocked by sequential \u003Ca href=\"\u002Ftag\u002Ftoken\">token\u003C\u002Fa> generation, the fastest path is to verify more than one token per pass, not to polish the same one-token loop harder.\u003C\u002Fp>\u003Ch2>EAGLE3 attacks the real bottleneck\u003C\u002Fh2>\u003Cp>Vanilla autoregressive decode is inherently serial. Each new token depends on the last, so even a well-tuned W4A8 path still pays for one full forward pass per generated token, plus KV-cache reads, routing, and sampling. The blog is blunt about the ceiling: on a large \u003Ca href=\"\u002Ftag\u002Fmoe\">MoE\u003C\u002Fa> model like Kimi-K2.5, that sequential structure creates a hard floor on TPOT that pure compute tuning cannot break.\u003C\u002Fp>\n\u003Cfigure class=\"my-6\">\u003Cimg src=\"https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1782640973161-00wl.png\" alt=\"EAGLE3 is the real speedup for Kimi-K2.5 on MI325X\" class=\"rounded-xl w-full\" loading=\"lazy\" \u002F>\u003C\u002Ffigure>\n\u003Cp>EAGLE3 changes the unit of work. Instead of asking the target model to generate one token and stop, the draft model proposes a short chain and the target verifies the whole chain in one pass. In the blog’s configuration, that means three speculative steps and four draft tokens, with accept length near the ceiling of 3.93 out of 4.0. That is not a marginal trick. It converts the decode loop from token-by-token execution into a batched verification problem, which is exactly where the hardware has room to win.\u003C\u002Fp>\u003Ch2>The gains are already large before any tuning\u003C\u002Fh2>\u003Cp>The clearest evidence is the baseline comparison. On 8× MI325X at concurrency 40, W4A8 without EAGLE3 posts 42.73 ms TPOT median and 672.30 tok\u002Fs output throughput. With EAGLE3 baseline enabled, those numbers improve to 27.79 ms and 872.58 tok\u002Fs. ITL median drops from 27.98 ms to 11.75 ms, while TTFT stays essentially flat because speculative decoding does not change prefill. That pattern matters: the performance win is concentrated exactly where users feel decode latency.\u003C\u002Fp>\u003Cp>Accuracy also matters, and the blog reports no measurable regression. That is the key reason this should be read as a production technique, not a lab demo. If a speedup forces model behavior to drift, it is a tradeoff. If it preserves accuracy while lifting throughput by nearly 30 percent, it is an architecture choice. The draft model is small, the target model is unchanged, and the verify step guarantees correctness by accepting only matching prefixes. The result is faster output without sacrificing the base model’s behavior.\u003C\u002Fp>\u003Ch2>Kernel tuning helps, but it is not the headline\u003C\u002Fh2>\u003Cp>The blog adds three shape-aware kernel changes for the EAGLE3 verify path: a Stage2 MoE tile_k increase to 256, a Stage1 scheduler-hint gate, and a bf16 round-to-zero conversion for FMHA. These are sensible adjustments for the new M=4 verify shape, and they do improve the stack a bit more. But the authors quantify the effect as only about 1 to 2 percent TPOT and 2 to 3 percent throughput on top of EAGLE3.\u003C\u002Fp>\n\u003Cfigure class=\"my-6\">\u003Cimg src=\"https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1782640968666-hc92.png\" alt=\"EAGLE3 is the real speedup for Kimi-K2.5 on MI325X\" class=\"rounded-xl w-full\" loading=\"lazy\" \u002F>\u003C\u002Ffigure>\n\u003Cp>That is the right priority order. The kernel patches are refinements to a better algorithm, not a substitute for one. The blog even explains why: on this 304-CU \u003Ca href=\"\u002Ftag\u002Fgpu\">GPU\u003C\u002Fa>, the touched MoE and FMHA paths are not the dominant bottlenecks once speculative decoding is in place. In other words, the hardware is no longer starved by math efficiency alone. It is being constrained by the decode structure itself, and EAGLE3 is the intervention that changes that structure.\u003C\u002Fp>\u003Ch2>The counter-argument\u003C\u002Fh2>\u003Cp>The strongest case against this view is that speculative decoding adds complexity, and complexity has operational cost. You need a matching draft model, extra launch flags, more moving parts in the serving stack, and careful tuning of draft depth and width. The blog also admits that poor draft quality can waste compute, and that some tree shapes inflate verify cost enough to erase gains. For teams that value simplicity above all else, a plain W4A8 decode path is easier to reason about and easier to support.\u003C\u002Fp>\u003Cp>There is also a valid portability objection. The EAGLE3 draft is trained against a specific target and does not transfer to unrelated models. That limits reuse across model families, which makes speculative decoding look less universal than a kernel optimization that can be applied more broadly. If your roadmap depends on many targets, or if you cannot keep draft and target pairs aligned, the maintenance burden is real.\u003C\u002Fp>\u003Cp>That objection does not beat the data. The blog shows that the draft-model overhead is small, the draft checkpoint is only about 6 GB, and the speedup is large enough to justify the extra serving complexity for the exact target model it was trained for. This is not a general-purpose trick for every stack. It is a high-leverage technique for a specific bottleneck, and that is enough. When decode is sequential and bandwidth-bound, changing the decode geometry is more valuable than polishing the same loop.\u003C\u002Fp>\u003Ch2>What to do with this\u003C\u002Fh2>\u003Cp>If you are an engineer running Kimi-K2.5-class workloads, prioritize speculative decoding first and kernel tuning second. Start with the EAGLE3 draft-target pair, validate accept length and throughput under your own concurrency, then add the small shape-aware kernel patches only after you confirm the verify path is the remaining bottleneck. If you are a PM or founder, treat this as a reminder that \u003Ca href=\"\u002Fnews\u002Fdatabricks-external-model-endpoints-governance-en\">model serving\u003C\u002Fa> performance is often won by changing the algorithmic unit of work, not by chasing another percent from the same kernel. The practical rule is simple: when decode dominates, verify more tokens per pass.\u003C\u002Fp>","EAGLE3 is the main reason Kimi-K2.5-W4A8 decodes faster on AMD MI325X, not kernel tweaks.","rocm.blogs.amd.com","https:\u002F\u002Frocm.blogs.amd.com\u002Fartificial-intelligence\u002Fkimi-k2.5-speculative\u002FREADME.html",null,"https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1782640973161-00wl.png","research","en","37acb4f1-36aa-4cbd-8c2f-0733c39a074f",[17,18,19,20,21],"EAGLE3","Kimi-K2.5-W4A8","AMD Instinct MI325X","ROCm","speculative decoding",[23,24,25],"EAGLE3 is the main source of the speedup; kernel tuning is a small add-on.","On 8× MI325X, EAGLE3 cuts TPOT from 42.73 ms to 27.79 ms and raises throughput from 672.30 to 872.58 tok\u002Fs.","The approach preserves accuracy and is best used when decode latency, not prefill, is the bottleneck.",0,"2026-06-28T10:02:26.706213+00:00","2026-06-28T10:02:26.691+00:00","3103988e-c4fe-45e3-98ab-846500c9d507",{"tags":31,"relatedLang":32,"relatedPosts":36},[],{"id":15,"slug":33,"title":34,"language":35},"eagle3-real-speedup-kimi-k25-mi325x-zh","EAGLE3 才是 Kimi-K2.5 在 MI325X 上真正的加速器","zh",[37,43,49,55,61,67],{"id":38,"slug":39,"title":40,"cover_image":41,"image_url":41,"created_at":42,"category":13},"cc337b93-2825-4fcc-a5af-77d41470616c","cuda-toolkit-13-3-fixes-nested-divergence-bug-en","CUDA Toolkit 13.3 fixes a nested-divergence bug","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1782676985164-f2uv.png","2026-06-28T20:02:39.771125+00:00",{"id":44,"slug":45,"title":46,"cover_image":47,"image_url":47,"created_at":48,"category":13},"772c0694-0e86-465d-b676-012a2240eaf7","llm-fine-tuning-turns-generic-models-into-domain-tools-en","LLM fine-tuning turns generic models into domain tools","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1782569906260-hdga.png","2026-06-27T14:17:57.190952+00:00",{"id":50,"slug":51,"title":52,"cover_image":53,"image_url":53,"created_at":54,"category":13},"25aef6a0-efaa-459c-bca4-77f0d462b792","rust-learners-need-permission-to-clone-first-en","Rust learners need permission to clone first, optimize later","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1782552763890-fem3.png","2026-06-27T09:32:21.788692+00:00",{"id":56,"slug":57,"title":58,"cover_image":59,"image_url":59,"created_at":60,"category":13},"567f2a82-494e-493a-9d43-00dfbc8a7bfd","mistral-ocr-4-document-ai-structure-en","Mistral OCR 4 brings structure to document AI","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1782468180808-ulcg.png","2026-06-26T10:02:37.910976+00:00",{"id":62,"slug":63,"title":64,"cover_image":65,"image_url":65,"created_at":66,"category":13},"de74bbd4-e3b6-407a-998b-b38c4170b586","autoregressive-boltzmann-generators-ditch-flows-en","Autoregressive Boltzmann Generators ditch flows","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1782455575877-62qe.png","2026-06-26T06:32:30.585573+00:00",{"id":68,"slug":69,"title":70,"cover_image":71,"image_url":71,"created_at":72,"category":13},"c05899fc-dd62-4fad-a249-9748376c1ef2","river-llm-reinforcement-learning-without-answers-en","RiVER trains LLMs without ground-truth answers","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1782454678234-6mk1.png","2026-06-26T06:17:27.491779+00:00",[74,79,84,89,94,99,104,109,114,119],{"id":75,"slug":76,"title":77,"created_at":78},"a2715e72-1fe8-41b3-abb1-d0cf1f710189","ai-predictions-2026-big-changes-en","AI Predictions for 2026: Brace for Big Changes","2026-03-26T01:25:07.788356+00:00",{"id":80,"slug":81,"title":82,"created_at":83},"8404bd7b-4c2f-4109-9ec4-baf29d88af2b","ml-papers-of-the-week-github-research-desk-en","ML Papers of the Week Turns GitHub Into a Research Desk","2026-03-27T01:11:39.480259+00:00",{"id":85,"slug":86,"title":87,"created_at":88},"87897a94-8065-4464-a016-1f23e89e17cc","ai-ml-conferences-to-watch-in-2026-en","AI\u002FML Conferences to Watch in 2026","2026-03-27T01:51:54.184108+00:00",{"id":90,"slug":91,"title":92,"created_at":93},"6f1987cf-25f3-47a4-b3e6-db0997695be8","openclaw-agents-manipulated-self-sabotage-en","OpenClaw Agents Can Be Manipulated Into Failure","2026-03-28T03:03:18.899465+00:00",{"id":95,"slug":96,"title":97,"created_at":98},"a53571ad-735a-4178-9f93-cb09b699d99c","vega-driving-language-instructions-en","Vega: Driving with Natural Language Instructions","2026-03-28T14:54:04.698882+00:00",{"id":100,"slug":101,"title":102,"created_at":103},"a34581d6-f36e-46da-88bb-582fb3e7425c","personalizing-autonomous-driving-styles-en","Drive My Way: Personalizing Autonomous Driving Styles","2026-03-28T14:54:26.148181+00:00",{"id":105,"slug":106,"title":107,"created_at":108},"2bc1ad7f-26ce-4f02-9885-803b35fd229d","training-knowledge-bases-writeback-rag-en","Training Knowledge Bases with WriteBack-RAG","2026-03-28T14:54:45.643433+00:00",{"id":110,"slug":111,"title":112,"created_at":113},"71adc507-3c54-4605-bbe2-c966acd6187e","packforcing-long-video-generation-en","PackForcing: Efficient Long-Video Generation Method","2026-03-28T14:55:02.646943+00:00",{"id":115,"slug":116,"title":117,"created_at":118},"675942ef-b9ec-4c5f-a997-381250b6eacb","pixelsmile-facial-expression-editing-en","PixelSmile Framework Enhances Facial Expression Editing","2026-03-28T14:55:20.633463+00:00",{"id":120,"slug":121,"title":122,"created_at":123},"6954fa2b-8b66-4839-884b-e46f89fa1bc3","adaptive-block-scaled-data-types-en","IF4: Smarter 4-Bit Quantization That Adapts to Your Data","2026-03-31T06:00:36.65963+00:00"]