TurboQuant cuts KV cache memory 6x in Google tests
Google Research says TurboQuant compresses KV caches by over 4x, with up to 6x less memory and no loss on long-context tests.

TurboQuant is Google Research’s 2025 vector-quantization method for compressing KV caches and embeddings.
Google Research’s TurboQuant is a 2025 online vector-quantization method built to shrink high-dimensional vectors without breaking their structure. In tests on long-context LLM workloads, the team said it matched a full-precision baseline while delivering more than 4x compression.
| 項目 | 數值 |
|---|---|
| Proposal year | 2025 |
| KV-cache memory reduction | At least 6x |
| Attention-logit speedup on H100 | Up to 8x |
| Compression in long-context tests | More than 4x |
| KV-cache quality threshold | 3.5 bits per channel |
| Benchmark context length | 4,000 to 104,000 tokens |
What changed
Get the latest AI news in your inbox
Weekly picks of model releases, tools, and deep dives — no spam, unsubscribe anytime.
No spam. Unsubscribe at any time.
TurboQuant was proposed by Amir Zandieh, Majid Daliri, Majid Hadian, and Vahab Mirrokni in the paper “TurboQuant: Online Vector Quantization with Near-optimal Distortion Rate.” The method targets three places where vector storage gets expensive: LLM inference, key-value cache compression, and nearest-neighbor search.

The algorithm comes in two modes. TurboQuant mse optimizes mean squared error, while TurboQuant prod aims at unbiased inner-product estimates. Both versions use a random rotation, then scalar quantization; the prod variant adds a one-bit Quantized Johnson–Lindenstrauss step to correct the residual error.
- TurboQuant mse stores each rotated coordinate with a scalar codebook.
- TurboQuant prod adds a sign sketch plus the residual norm.
- The paper reports distortion shrinking with bit width, with example MSE values near 0.36, 0.117, 0.03, and 0.009 at 1 to 4 bits.
- Google Research said it tested TurboQuant on LongBench, Needle in a Haystack, ZeroSCROLLS, RULER, and L-Eval.
Why it matters
For developers running LLMs, KV cache size is often the memory bottleneck. Google says TurboQuant cut that footprint by at least 6x and improved attention-logit computation by up to 8x on Nvidia H100 GPUs compared with unquantized 32-bit keys.

The bigger point is that TurboQuant is online and data-oblivious, so it avoids the offline calibration and codebook training many older quantization schemes need. That makes it easier to slot into serving stacks for long-context chat, retrieval, and vector search.
The open question is how much of Google’s result holds across different models, workloads, and hardware. The method looks strong on paper and in Google’s own tests, but real-world adoption will depend on implementation cost and whether the memory savings show up outside benchmark runs.
// Related Articles
- [RSCH]
MemDreamer tackles long-video overload
- [RSCH]
Agentopia simulates 10 years of agent society
- [RSCH]
LLMs stumble on counterintuitive probability
- [RSCH]
Bento turns WebAssembly memory into compartments
- [RSCH]
BIS turns stablecoin rules into usable buffers
- [RSCH]
How to Prevent Catastrophic Forgetting in LLM Fine-Tuning