UltraQuant: 4-bit KV caching for long agents

OraCore Editors

Back to home

[RSCH] June 25, 20269 min readOraCore Editors

UltraQuant: 4-bit KV caching for long agents

UltraQuant shows 4-bit KV caching can speed long, multi-turn agent serving while keeping more context resident.

KV cache quantization vLLM

Share LinkedIn

UltraQuant: 4-bit KV caching for long agents

UltraQuant shows 4-bit KV caching can speed long, multi-turn agent serving while keeping more context resident.

Research org: Advanced Micro Devices + UCLA + Purdue University
Core data: 3.47× P50 TTFT improvement in late rounds
Breakthrough: FP4 KV tensors with UE8M0 scales on CDNA4 scaled-MFMA

Long-context agents are a memory problem as much as a model problem. When a session stretches across many turns, the KV cache grows, fills HBM, and starts pushing useful context out of device memory. This paper looks at what happens when you compress that cache to 4 bits without breaking serving performance.

The practical question is not just “can we shrink KV state?” It is “can we shrink it in a way that still works for agentic workloads, where the same prefix is reused over and over and concurrency matters?” UltraQuant argues that the answer is yes, but only if you treat quality, cache residency, and kernel efficiency as one system.

What problem this paper is trying to fix

Get the latest AI news in your inbox

Weekly picks of model releases, tools, and deep dives — no spam, unsubscribe anytime.

No spam. Unsubscribe at any time.

Modern LLM agents do more than chat. They browse, inspect repositories, invoke tools, and carry a long working memory across many rounds. That means the KV cache can become a first-order consumer of high-bandwidth memory, especially as context windows move toward one million tokens and beyond.

The paper frames this as a serving bottleneck. If the cache is too large, the system evicts useful prefixes and has to re-prefill them later. If compression introduces too much overhead, the model may save memory but lose latency. So the real target is not raw compression alone; it is preserving resident context while keeping TTFT and TPOT under control.

UltraQuant is positioned against two anchors in the paper: TurboQuant-style 4-bit quantization as a quality anchor, and vLLM FP8 KV caching as a deployment anchor. FP8 already gives roughly 2× compression with near-lossless quality and native hardware support, so a 4-bit scheme has to justify itself beyond just smaller tensors.

How the method works in plain English

The paper studies two related paths. The first is Ultra-TurboQuant, or Ultra-TQ, which keeps the TurboQuant representation but improves the implementation. The second is UltraQuant, which goes a step further and replaces the codebook with a hardware-native FP4 approximation path.

Ultra-TQ starts from the same basic idea as TurboQuant: rotate the KV vectors so outliers are spread across channels, making the distribution easier to quantize. The paper uses a Walsh–Hadamard rotation and removes QJL, while also making keys and values asymmetric because the two behave differently under quantization.

One practical tweak the paper emphasizes is calibrated centroids. Instead of relying only on theoretical Lloyd–Max centroids, it refits the 16-entry table on captured activations. The paper says this is cheap: a single forward pass over about 20 sampled vectors per rotated layer, and in the real implementation it is only applied to the 10% of layers with higher per-element quantization MSE.

UltraQuant then swaps the codebook path for an FP4 micro-tensor approach. The abstract says this uses FP8 queries, FP4 KV tensors, UE8M0 group scales, and native scaled-MFMA support on CDNA4. In other words, the design tries to make dequantization disappear into the matrix core instead of paying extra software lookup costs.

That matters because codebook quantization can be accurate but still awkward for serving. The paper explicitly calls out lookup and irregular-access overhead as a weakness of codebook-based methods. UltraQuant’s bet is that a hardware-native format is easier to run at speed than a more exact but less convenient representation.

What the paper actually shows

The evaluation is built around an agentic workload rather than a single-turn benchmark. The paper uses vLLM’s native multi-turn benchmark with ShareGPT conversation data, serves 32 concurrent chat sessions, and reports P50 TTFT and P50 TPOT. That setup is meant to simulate cache pressure in long-running sessions where prefixes get reused across turns.

On a long-context, multi-turn agentic workload, UltraQuant reports a 3.47× improvement in P50 TTFT in the late rounds, where the cache is most pressured. Across all rounds, the TTFT improvement is 2.3×, and output throughput rises by 1.63× over the FP8 KV baseline.

The paper also notes an important nuance: UltraQuant is not uniformly faster in every phase. In the warm rounds, the reported P50 TTFT is 0.86× relative to FP8 KV, which means FP8 is faster there. The advantage shows up later, when long per-client prefixes exceed the effective resident-cache capacity of FP8 and UltraQuant keeps more prefixes on device.

That makes the result more interesting than a simple compression story. The improvement is attributed to cache residency rather than re-prefill, and the per-round breakdown shows UltraQuant keeping both TTFT and TPOT low across all six conversation rounds while FP8 degrades as context accumulates.

The paper also reports an accuracy-related result for calibrated centroids. Compared with an apples-to-apples fakequant control, refitting the codebook lowers per-element K quantization MSE by 10.3%, from 1.32×10^-4 to 1.18×10^-4. The abstract does not give a full task-accuracy table in the excerpt provided, so this MSE result is the clearest concrete quality number available here.

What developers should take away

If you are building or serving context-heavy agents, this paper is a reminder that KV-cache design is part of the product, not just a backend detail. A 4-bit scheme only helps if it survives the full path from quantization to kernel execution to multi-turn serving behavior.

The strongest engineering lesson is that compression format and deployment format are not the same thing. TurboQuant-style codebooks may be a good algorithmic anchor, but if they force expensive lookup or dequantization, the serving system may give back the gains. UltraQuant’s approach is to align the representation with the hardware path instead of treating them separately.

Another useful takeaway is that the best metric is workload-specific. This paper does not present a generic “one benchmark to rule them all” claim. It focuses on long-context, concurrent, multi-turn agent sessions, which is exactly where cache residency and latency interact in messy ways. If your workload is mostly short prompts, the tradeoff may look different.

Limitations and open questions

The abstract and notes provided do not include a broad set of benchmark numbers beyond the agentic serving results and the quantization MSE example. That means readers should be careful about extrapolating the gains to other models, other workloads, or other hardware.

The paper is also explicit that its gains are tied to AMD Instinct GPUs and CDNA4 native support. UltraQuant’s FP4 path depends on hardware features such as scaled-MFMA, so the implementation story may not transfer cleanly to other accelerators.

There is also a tradeoff visible in the results themselves: the warm rounds do not show the same benefit as the cache-pressured late rounds. That suggests the win comes from reducing eviction and keeping more context resident, not from a universal per-token speedup.

For developers, the main question is whether your serving stack is already bottlenecked by resident KV capacity. If it is, UltraQuant points to a path where 4-bit caching is not just a memory optimization but a latency optimization for real agent workloads. If not, the payoff may be smaller than the headline numbers suggest.

Bottom line

UltraQuant is a serving-first take on 4-bit KV caching for long-running agents. It combines quantization choices, cache layout, and AMD GPU kernel support to make compressed KV state practical in a multi-turn setting.

The paper’s core message is simple: for context-heavy agents, the right 4-bit cache can improve both memory residency and end-to-end serving speed, but only when the representation and the hardware path are designed together.

4-bit KV caching is only useful if it helps resident context, not just tensor size.
UltraQuant’s strongest gains show up in late, cache-pressured rounds.
The paper’s main contribution is a hardware-aware path from quantization to serving.

// Related Articles

UltraQuant: 4-bit KV caching for long agents

What problem this paper is trying to fix

Get the latest AI news in your inbox

How the method works in plain English

What the paper actually shows

What developers should take away

Limitations and open questions

Bottom line

OPSD lets you turn user clicks into training

FLUX3D fixes 3DGS detail loss from images

Stochastic Subgradient Last Iterate Gets Tight Bounds

InSight lets VLAs learn new skills on their own

Anthropic is right to sound the alarm on recursive self-improvement

OpenAI’s bug hunt rattled Chrome, Safari, Firefox