[IND] 5 min readOraCore Editors

TurboQuant on AMD GPUs cuts KV-cache latency

TurboQuant on AMD GPUs improves long-context LLM serving with up to 3.6x speedup and far lower KV-cache pressure.

Share LinkedIn
TurboQuant on AMD GPUs cuts KV-cache latency

TurboQuant on AMD GPUs lowers KV-cache pressure and speeds up long-context LLM inference.

TurboQuant is most useful when KV cache, not compute, limits serving, and this ROCm write-up shows how AMD GPUs can close the gap with optimized kernels. The post reports up to 3.6x end-to-end speedup over the open-source vLLM TurboQuant baseline.

ItemWhat it changesReported result
TQ4/44-bit K and 4-bit V compressionRecommended default balance
Agentic workload test100 conversations, 32 concurrency, ~25K prefixesTTFT 13.9 s to 0.89 s
Cache hit rateFP8 vs TQ4/45.3% to 67.7%
End-to-end speedupOptimized ROCm kernels vs open-source baselineUp to 3.6x

1. Production TurboQuant on ROCm

Get the latest AI news in your inbox

Weekly picks of model releases, tools, and deep dives — no spam, unsubscribe anytime.

No spam. Unsubscribe at any time.

The core story is not just that TurboQuant compresses KV cache. It is that the AMD ROCm implementation makes the algorithm practical for serving, where kernel quality, memory behavior, and latency matter as much as accuracy.

TurboQuant on AMD GPUs cuts KV-cache latency

The authors describe a version integrated into vLLM and tuned with custom Triton, HIP, and FlyDSL kernels. That matters because the open-source baseline is not a fair target unless the compression path is also competitive at the kernel level.

  • Target runtime: vLLM on AMD Instinct GPUs
  • Optimization stack: Triton, native HIP ISA control, FlyDSL
  • Goal: reduce KV-cache footprint without breaking serving throughput

2. TQ4/4 as the default production setting

The post recommends TQ4/4, meaning 4-bit keys and 4-bit values, as the default production choice. That recommendation comes from a tradeoff curve that balances compression, accuracy, and runtime cost better than more aggressive or more complex variants.

For readers choosing a deployment setting, this is the clearest practical takeaway in the article. The authors also note that keys are more sensitive than values, so the implementation puts rotation and LUT-based quantization on K, while using standard uniform quantization for V.

  • K-side gets rotation plus LUT quantization
  • V-side uses standard uniform quantization
  • 2-bit modes are possible, but the overhead is harder to justify

3. Boundary-layer skipping for softmax models

One of the simplest accuracy fixes is to skip quantizing the first and last layers for full-attention models. The article says those boundary layers are often more sensitive to KV quantization, and leaving them in full precision can recover meaningful accuracy for a modest loss in compression.

TurboQuant on AMD GPUs cuts KV-cache latency

This is not applied everywhere. The authors follow the vLLM heuristic of using boundary-layer skipping for softmax attention models, while not carrying that rule over to hybrid attention models such as Qwen3.5.

--kv-cache-dtype-skip-layers # used for boundary layers on softmax attention models

4. Walsh-Hadamard rotation instead of random rotation

The original TurboQuant design allows random rotation, but the ROCm implementation prefers Walsh-Hadamard transform, or WHT. The reason is straightforward: it is friendlier to kernels and it also performs better in the reported experiments.

That choice shows up in both accuracy and implementation simplicity. The post says WHT spreads energy well, which helps the quantizer, and it avoids the awkwardness of dense random rotation paths in production kernels.

  • Better kernel fit than random rotation
  • Better empirical accuracy in the tested setups
  • Matches the direction taken by TurboQuant+ and llama.cpp work

5. Drop QJL in the 4-bit path

The article is unusually direct about QJL: at the 4-bit budget, it adds complexity and runtime overhead without helping accuracy. In the authors’ tests, omitting QJL produced the strongest results among the configurations they compared.

They also diagnose why some QJL variants fail. A raw Gaussian projection matrix underperforms, while orthogonalized Gaussian and Walsh-Hadamard projections recover much of the gap. Even so, the 4-bit path is best served by skipping QJL altogether.

  • Raw Gaussian QJL performs worst on keys
  • Orthogonal-Gaussian and Walsh-Hadamard recover most of the loss
  • At 4 bits, MSE-only beats every K-side QJL variant in the sweep

What to pick

If you are deploying long-context, multi-turn agents, start with TQ4/4, WHT rotation, and boundary-layer skipping for softmax models. That combination gives the best mix of compression and serving behavior in the article’s production setup.

If your workload is less memory-bound or your accuracy bar is tighter, stay closer to BF16 or FP8 and use the TurboQuant findings as a guide for which parts of the cache path are worth compressing first. The clearest rule here is that KV-cache pressure is where TurboQuant pays off most.