[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"article-turboquant-amd-gpus-kv-cache-latency-en":3,"article-related-turboquant-amd-gpus-kv-cache-latency-en":35,"series-industry-093f7c46-be7c-4b62-be00-73808a61e0a0":88},{"id":4,"slug":5,"title":6,"content":7,"summary":8,"source":9,"source_url":10,"author":11,"image_url":12,"cover_image":12,"category":13,"language":14,"translated_content":11,"related_article_id":15,"keywords":16,"key_takeaways":27,"views":31,"created_at":32,"published_at":33,"topic_cluster_id":34},"093f7c46-be7c-4b62-be00-73808a61e0a0","turboquant-amd-gpus-kv-cache-latency-en","TurboQuant on AMD GPUs cuts KV-cache latency","\u003Cp data-speakable=\"summary\">\u003Ca href=\"\u002Ftag\u002Fturboquant\">TurboQuant\u003C\u002Fa> on AMD GPUs lowers KV-cache pressure and speeds up long-context \u003Ca href=\"\u002Ftag\u002Fllm\">LLM\u003C\u002Fa> \u003Ca href=\"\u002Ftag\u002Finference\">inference\u003C\u002Fa>.\u003C\u002Fp>\n\u003Cp>TurboQuant is most useful when \u003Ca href=\"\u002Ftag\u002Fkv-cache\">KV cache\u003C\u002Fa>, not compute, limits serving, and this ROCm write-up shows how AMD GPUs can close the gap with optimized kernels. The post reports up to 3.6x end-to-end speedup over the open-source \u003Ca href=\"\u002Ftag\u002Fvllm\">vLLM\u003C\u002Fa> TurboQuant baseline.\u003C\u002Fp>\n\u003Ctable>\u003Cthead>\u003Ctr>\u003Cth>Item\u003C\u002Fth>\u003Cth>What it changes\u003C\u002Fth>\u003Cth>Reported result\u003C\u002Fth>\u003C\u002Ftr>\u003C\u002Fthead>\u003Ctbody>\u003Ctr>\u003Ctd>TQ4\u002F4\u003C\u002Ftd>\u003Ctd>4-bit K and 4-bit V compression\u003C\u002Ftd>\u003Ctd>Recommended default balance\u003C\u002Ftd>\u003C\u002Ftr>\u003Ctr>\u003Ctd>Agentic workload test\u003C\u002Ftd>\u003Ctd>100 conversations, 32 concurrency, ~25K prefixes\u003C\u002Ftd>\u003Ctd>TTFT 13.9 s to 0.89 s\u003C\u002Ftd>\u003C\u002Ftr>\u003Ctr>\u003Ctd>Cache hit rate\u003C\u002Ftd>\u003Ctd>FP8 vs TQ4\u002F4\u003C\u002Ftd>\u003Ctd>5.3% to 67.7%\u003C\u002Ftd>\u003C\u002Ftr>\u003Ctr>\u003Ctd>End-to-end speedup\u003C\u002Ftd>\u003Ctd>Optimized ROCm kernels vs open-source baseline\u003C\u002Ftd>\u003Ctd>Up to 3.6x\u003C\u002Ftd>\u003C\u002Ftr>\u003C\u002Ftbody>\u003C\u002Ftable>\n\u003Ch2>1. Production TurboQuant on ROCm\u003C\u002Fh2>\n\u003Cp>The core story is not just that TurboQuant compresses KV cache. It is that the AMD ROCm implementation makes the algorithm practical for serving, where kernel quality, memory behavior, and latency matter as much as accuracy.\u003C\u002Fp>\n\u003Cfigure class=\"my-6\">\u003Cimg src=\"https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1781299067778-3pzd.png\" alt=\"TurboQuant on AMD GPUs cuts KV-cache latency\" class=\"rounded-xl w-full\" loading=\"lazy\" \u002F>\u003C\u002Ffigure>\n\n\u003Cp>The authors describe a version integrated into \u003Ca href=\"https:\u002F\u002Fdocs.vllm.ai\u002F\">vLLM\u003C\u002Fa> and tuned with custom Triton, HIP, and FlyDSL kernels. That matters because the open-source baseline is not a fair target unless the compression path is also competitive at the kernel level.\u003C\u002Fp>\n\u003Cul>\n  \u003Cli>Target runtime: vLLM on AMD Instinct GPUs\u003C\u002Fli>\n  \u003Cli>Optimization stack: Triton, native HIP ISA control, FlyDSL\u003C\u002Fli>\n  \u003Cli>Goal: reduce KV-cache footprint without breaking serving throughput\u003C\u002Fli>\n\u003C\u002Ful>\n\u003Ch2>2. TQ4\u002F4 as the default production setting\u003C\u002Fh2>\n\u003Cp>The post recommends TQ4\u002F4, meaning 4-bit keys and 4-bit values, as the default production choice. That recommendation comes from a tradeoff curve that balances compression, accuracy, and runtime cost better than more aggressive or more complex variants.\u003C\u002Fp>\n\u003Cp>For readers choosing a deployment setting, this is the clearest practical takeaway in the article. The authors also note that keys are more sensitive than values, so the implementation puts rotation and LUT-based quantization on K, while using standard uniform quantization for V.\u003C\u002Fp>\n\u003Cul>\n  \u003Cli>K-side gets rotation plus LUT quantization\u003C\u002Fli>\n  \u003Cli>V-side uses standard uniform quantization\u003C\u002Fli>\n  \u003Cli>2-bit modes are possible, but the overhead is harder to justify\u003C\u002Fli>\n\u003C\u002Ful>\n\u003Ch2>3. Boundary-layer skipping for softmax models\u003C\u002Fh2>\n\u003Cp>One of the simplest accuracy fixes is to skip quantizing the first and last layers for full-attention models. The article says those boundary layers are often more sensitive to KV quantization, and leaving them in full precision can recover meaningful accuracy for a modest loss in compression.\u003C\u002Fp>\n\u003Cfigure class=\"my-6\">\u003Cimg src=\"https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1781299085839-ypdl.png\" alt=\"TurboQuant on AMD GPUs cuts KV-cache latency\" class=\"rounded-xl w-full\" loading=\"lazy\" \u002F>\u003C\u002Ffigure>\n\n\u003Cp>This is not applied everywhere. The authors follow the vLLM heuristic of using boundary-layer skipping for softmax attention models, while not carrying that rule over to hybrid attention models such as Qwen3.5.\u003C\u002Fp>\n\u003Ccode>--kv-cache-dtype-skip-layers\n# used for boundary layers on softmax attention models\u003C\u002Fcode>\n\u003Ch2>4. Walsh-Hadamard rotation instead of random rotation\u003C\u002Fh2>\n\u003Cp>The original TurboQuant design allows random rotation, but the ROCm implementation prefers Walsh-Hadamard transform, or WHT. The reason is straightforward: it is friendlier to kernels and it also performs better in the reported experiments.\u003C\u002Fp>\n\u003Cp>That choice shows up in both accuracy and implementation simplicity. The post says WHT spreads energy well, which helps the quantizer, and it avoids the awkwardness of dense random rotation paths in production kernels.\u003C\u002Fp>\n\u003Cul>\n  \u003Cli>Better kernel fit than random rotation\u003C\u002Fli>\n  \u003Cli>Better empirical accuracy in the tested setups\u003C\u002Fli>\n  \u003Cli>Matches the direction taken by TurboQuant+ and llama.cpp work\u003C\u002Fli>\n\u003C\u002Ful>\n\u003Ch2>5. Drop QJL in the 4-bit path\u003C\u002Fh2>\n\u003Cp>The article is unusually direct about QJL: at the 4-bit budget, it adds complexity and runtime overhead without helping accuracy. In the authors’ tests, omitting QJL produced the strongest results among the configurations they compared.\u003C\u002Fp>\n\u003Cp>They also diagnose why some QJL variants fail. A raw Gaussian projection matrix underperforms, while orthogonalized Gaussian and Walsh-Hadamard projections recover much of the gap. Even so, the 4-bit path is best served by skipping QJL altogether.\u003C\u002Fp>\n\u003Cul>\n  \u003Cli>Raw Gaussian QJL performs worst on keys\u003C\u002Fli>\n  \u003Cli>Orthogonal-Gaussian and Walsh-Hadamard recover most of the loss\u003C\u002Fli>\n  \u003Cli>At 4 bits, MSE-only beats every K-side QJL variant in the sweep\u003C\u002Fli>\n\u003C\u002Ful>\n\u003Ch2>What to pick\u003C\u002Fh2>\n\u003Cp>If you are deploying long-context, multi-turn agents, start with TQ4\u002F4, WHT rotation, and boundary-layer skipping for softmax models. That combination gives the best mix of compression and serving behavior in the article’s production setup.\u003C\u002Fp>\n\u003Cp>If your workload is less memory-bound or your accuracy bar is tighter, stay closer to BF16 or FP8 and use the TurboQuant findings as a guide for which parts of the cache path are worth compressing first. The clearest rule here is that KV-cache pressure is where TurboQuant pays off most.\u003C\u002Fp>","TurboQuant on AMD GPUs improves long-context LLM serving with up to 3.6x speedup and far lower KV-cache pressure.","rocm.blogs.amd.com","https:\u002F\u002Frocm.blogs.amd.com\u002Fartificial-intelligence\u002Fturboquant-vllm-agentic\u002FREADME.html",null,"https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1781299067778-3pzd.png","industry","en","4fae6813-4bb1-459e-9556-1bd8b0b4ca4e",[17,18,19,20,21,22,23,24,25,26],"TurboQuant","AMD GPUs","ROCm","vLLM","KV cache","LLM inference","agentic workloads","HIP","Triton","FlyDSL",[28,29,30],"TurboQuant on AMD GPUs is aimed at KV-cache-bound serving, not general compute-bound inference.","TQ4\u002F4 is the recommended production setting in the post, with WHT rotation and no QJL at 4 bits.","On a 100-conversation agentic test, TTFT dropped from 13.9 s to 0.89 s and cache hit rate rose from 5.3% to 67.7%.",0,"2026-06-12T21:17:26.07+00:00","2026-06-12T21:17:26.063+00:00","cc1bbc9d-156b-47b1-8c38-554dfca04095",{"tags":36,"relatedLang":47,"relatedPosts":51},[37,39,41,43,45],{"name":21,"slug":38},"kv-cache",{"name":20,"slug":40},"vllm",{"name":18,"slug":42},"amd-gpus",{"name":19,"slug":44},"rocm",{"name":17,"slug":46},"turboquant",{"id":15,"slug":48,"title":49,"language":50},"turboquant-amd-gpus-kv-cache-latency-zh","TurboQuant 在 AMD GPU 上把長上下文延遲壓下來","zh",[52,58,64,70,76,82],{"id":53,"slug":54,"title":55,"cover_image":56,"image_url":56,"created_at":57,"category":13},"d96ff33a-47a4-421f-b7d4-ded157b345b6","anthropic-public-record-ai-anxiety-policy-en","Anthropic’s survey turns AI anxiety into policy","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1781327893716-5hv3.png","2026-06-13T05:17:42.92009+00:00",{"id":59,"slug":60,"title":61,"cover_image":62,"image_url":62,"created_at":63,"category":13},"07f6818a-6612-4e79-a0b6-7b5014fadafc","chatgpt-grew-from-chatbot-to-platform-en","ChatGPT grew from chatbot to platform","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1781325174493-j6tn.png","2026-06-13T04:32:28.006595+00:00",{"id":65,"slug":66,"title":67,"cover_image":68,"image_url":68,"created_at":69,"category":13},"c750890e-4ddf-4e1c-85d5-a5bd4433620f","openai-files-confidential-ipo-after-122b-round-en","OpenAI Files Confidential IPO After $122B Round","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1781323367848-n0ns.png","2026-06-13T04:02:24.359675+00:00",{"id":71,"slug":72,"title":73,"cover_image":74,"image_url":74,"created_at":75,"category":13},"b0cb27e2-ca71-40a2-a012-73627f1c995c","government-access-orders-frontier-model-access-en","Government access orders should govern frontier model access","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1781319762267-0x3b.png","2026-06-13T03:02:19.503078+00:00",{"id":77,"slug":78,"title":79,"cover_image":80,"image_url":80,"created_at":81,"category":13},"fac6f2b6-6a69-4fef-83c8-45eb5d323004","claude-code-cursor-copilot-2026-ai-agents-en","Claude Code, Cursor, and Copilot set the 2026 bar","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1781317069662-0zc1.png","2026-06-13T02:17:22.342047+00:00",{"id":83,"slug":84,"title":85,"cover_image":86,"image_url":86,"created_at":87,"category":13},"34c51881-fb17-4b47-a6d3-be251db39bee","anthropic-claude-design-partner-risk-en","Anthropic’s Claude Design launch exposed partner risk","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1781316166402-myx8.png","2026-06-13T02:02:21.284902+00:00",[89,94,99,104,109,114,119,124,129,134],{"id":90,"slug":91,"title":92,"created_at":93},"d35a1bd9-e709-412e-a2df-392df1dc572a","ai-impact-2026-developments-market-en","AI's Impact in 2026: Key Developments and Market Shifts","2026-03-25T16:20:33.205823+00:00",{"id":95,"slug":96,"title":97,"created_at":98},"5ed27921-5fd6-492e-8c59-78393bf37710","trumps-ai-legislative-framework-en","Trump's AI Legislative Framework: What's Inside?","2026-03-25T16:22:20.005325+00:00",{"id":100,"slug":101,"title":102,"created_at":103},"e454a642-f03c-4794-b185-5f651aebbaca","nvidia-gtc-2026-key-highlights-innovations-en","NVIDIA GTC 2026: Key Highlights and Innovations","2026-03-25T16:22:47.882615+00:00",{"id":105,"slug":106,"title":107,"created_at":108},"0ebb5b16-774a-4922-945d-5f2ce1df5a6d","claude-usage-diversifies-learning-curves-en","Claude Usage Diversifies, Learning Curves Emerge","2026-03-25T16:25:50.770376+00:00",{"id":110,"slug":111,"title":112,"created_at":113},"69934e86-2fc5-4280-8223-7b917a48ace8","openclaw-ai-commoditization-concerns-en","OpenClaw's Rise Raises Concerns of AI Model Commoditization","2026-03-25T16:26:30.582047+00:00",{"id":115,"slug":116,"title":117,"created_at":118},"b4b2575b-2ac8-46b2-b90e-ab1d7c060797","google-gemini-ai-rollout-2026-en","Google's Gemini AI Rollout Extended to 2026","2026-03-25T16:28:14.808842+00:00",{"id":120,"slug":121,"title":122,"created_at":123},"6e18bc65-42ae-4ad0-b564-67d7f66b979e","meta-llama4-fabricated-results-scandal-en","Meta's Llama 4 Scandal: Fabricated AI Test Results Unveiled","2026-03-25T16:29:15.482836+00:00",{"id":125,"slug":126,"title":127,"created_at":128},"bf888e9d-08be-4f47-996c-7b24b5ab3500","accenture-mistral-ai-deployment-en","Accenture and Mistral AI Team Up for AI Deployment","2026-03-25T16:31:01.894655+00:00",{"id":130,"slug":131,"title":132,"created_at":133},"5382b536-fad2-49c6-ac85-9eb2bae49f35","mistral-ai-high-stakes-2026-en","Mistral AI: Facing High Stakes in 2026","2026-03-25T16:31:39.941974+00:00",{"id":135,"slug":136,"title":137,"created_at":138},"9da3d2d6-b669-4971-ba1d-17fdb3548ed5","cursors-meteoric-rise-pressures-en","Cursor's Meteoric Rise Faces Industry Pressures","2026-03-25T16:32:21.899217+00:00"]