TurboQuant on AMD GPUs cuts KV-cache latency

OraCore Editors

Back to home

[IND] June 13, 20265 min readOraCore Editors

TurboQuant on AMD GPUs cuts KV-cache latency

TurboQuant on AMD GPUs improves long-context LLM serving with up to 3.6x speedup and far lower KV-cache pressure.

KV cache vLLM TurboQuant

Share LinkedIn

TurboQuant on AMD GPUs cuts KV-cache latency

TurboQuant on AMD GPUs lowers KV-cache pressure and speeds up long-context LLM inference.

TurboQuant is most useful when KV cache, not compute, limits serving, and this ROCm write-up shows how AMD GPUs can close the gap with optimized kernels. The post reports up to 3.6x end-to-end speedup over the open-source vLLM TurboQuant baseline.

Item	What it changes	Reported result
TQ4/4	4-bit K and 4-bit V compression	Recommended default balance
Agentic workload test	100 conversations, 32 concurrency, ~25K prefixes	TTFT 13.9 s to 0.89 s
Cache hit rate	FP8 vs TQ4/4	5.3% to 67.7%
End-to-end speedup	Optimized ROCm kernels vs open-source baseline	Up to 3.6x

1. Production TurboQuant on ROCm

Get the latest AI news in your inbox

Weekly picks of model releases, tools, and deep dives — no spam, unsubscribe anytime.

No spam. Unsubscribe at any time.

The core story is not just that TurboQuant compresses KV cache. It is that the AMD ROCm implementation makes the algorithm practical for serving, where kernel quality, memory behavior, and latency matter as much as accuracy.

The authors describe a version integrated into vLLM and tuned with custom Triton, HIP, and FlyDSL kernels. That matters because the open-source baseline is not a fair target unless the compression path is also competitive at the kernel level.

Target runtime: vLLM on AMD Instinct GPUs
Optimization stack: Triton, native HIP ISA control, FlyDSL
Goal: reduce KV-cache footprint without breaking serving throughput

2. TQ4/4 as the default production setting

The post recommends TQ4/4, meaning 4-bit keys and 4-bit values, as the default production choice. That recommendation comes from a tradeoff curve that balances compression, accuracy, and runtime cost better than more aggressive or more complex variants.

For readers choosing a deployment setting, this is the clearest practical takeaway in the article. The authors also note that keys are more sensitive than values, so the implementation puts rotation and LUT-based quantization on K, while using standard uniform quantization for V.

K-side gets rotation plus LUT quantization
V-side uses standard uniform quantization
2-bit modes are possible, but the overhead is harder to justify

3. Boundary-layer skipping for softmax models

One of the simplest accuracy fixes is to skip quantizing the first and last layers for full-attention models. The article says those boundary layers are often more sensitive to KV quantization, and leaving them in full precision can recover meaningful accuracy for a modest loss in compression.

This is not applied everywhere. The authors follow the vLLM heuristic of using boundary-layer skipping for softmax attention models, while not carrying that rule over to hybrid attention models such as Qwen3.5.

--kv-cache-dtype-skip-layers
# used for boundary layers on softmax attention models

4. Walsh-Hadamard rotation instead of random rotation

The original TurboQuant design allows random rotation, but the ROCm implementation prefers Walsh-Hadamard transform, or WHT. The reason is straightforward: it is friendlier to kernels and it also performs better in the reported experiments.

That choice shows up in both accuracy and implementation simplicity. The post says WHT spreads energy well, which helps the quantizer, and it avoids the awkwardness of dense random rotation paths in production kernels.

Better kernel fit than random rotation
Better empirical accuracy in the tested setups
Matches the direction taken by TurboQuant+ and llama.cpp work

5. Drop QJL in the 4-bit path

The article is unusually direct about QJL: at the 4-bit budget, it adds complexity and runtime overhead without helping accuracy. In the authors’ tests, omitting QJL produced the strongest results among the configurations they compared.

They also diagnose why some QJL variants fail. A raw Gaussian projection matrix underperforms, while orthogonalized Gaussian and Walsh-Hadamard projections recover much of the gap. Even so, the 4-bit path is best served by skipping QJL altogether.

Raw Gaussian QJL performs worst on keys
Orthogonal-Gaussian and Walsh-Hadamard recover most of the loss
At 4 bits, MSE-only beats every K-side QJL variant in the sweep

What to pick

If you are deploying long-context, multi-turn agents, start with TQ4/4, WHT rotation, and boundary-layer skipping for softmax models. That combination gives the best mix of compression and serving behavior in the article’s production setup.

If your workload is less memory-bound or your accuracy bar is tighter, stay closer to BF16 or FP8 and use the TurboQuant findings as a guide for which parts of the cache path are worth compressing first. The clearest rule here is that KV-cache pressure is where TurboQuant pays off most.

// Related Articles

TurboQuant on AMD GPUs cuts KV-cache latency

1. Production TurboQuant on ROCm

Get the latest AI news in your inbox

2. TQ4/4 as the default production setting

3. Boundary-layer skipping for softmax models

4. Walsh-Hadamard rotation instead of random rotation

5. Drop QJL in the 4-bit path

What to pick

Anthropic’s Book Scanning Strategy Could Set a Pattern

Huang’s open-letter playbook for open-weight AI

32 firms back open-weight AI in DC letter

Huang usa il suo primo post su X per difendere l’IA aperta

Black Duck’s Coverity gets better at AI-era triage

Anthropic’s Opus 5 makes the AI race cheaper