TurboQuant cuts KV cache memory 6x in Google tests

OraCore Editors

[RSCH] June 8, 20263 min readOraCore Editors

TurboQuant cuts KV cache memory 6x in Google tests

Google Research says TurboQuant compresses KV caches by over 4x, with up to 6x less memory and no loss on long-context tests.

Google Research KV cache LLM inference TurboQuant vector quantization

Share LinkedIn

TurboQuant cuts KV cache memory 6x in Google tests

TurboQuant is Google Research’s 2025 vector-quantization method for compressing KV caches and embeddings.

Google Research’s TurboQuant is a 2025 online vector-quantization method built to shrink high-dimensional vectors without breaking their structure. In tests on long-context LLM workloads, the team said it matched a full-precision baseline while delivering more than 4x compression.

項目	數值
Proposal year	2025
KV-cache memory reduction	At least 6x
Attention-logit speedup on H100	Up to 8x
Compression in long-context tests	More than 4x
KV-cache quality threshold	3.5 bits per channel
Benchmark context length	4,000 to 104,000 tokens

What changed

Get the latest AI news in your inbox

Weekly picks of model releases, tools, and deep dives — no spam, unsubscribe anytime.

No spam. Unsubscribe at any time.

TurboQuant was proposed by Amir Zandieh, Majid Daliri, Majid Hadian, and Vahab Mirrokni in the paper “TurboQuant: Online Vector Quantization with Near-optimal Distortion Rate.” The method targets three places where vector storage gets expensive: LLM inference, key-value cache compression, and nearest-neighbor search.

The algorithm comes in two modes. TurboQuant mse optimizes mean squared error, while TurboQuant prod aims at unbiased inner-product estimates. Both versions use a random rotation, then scalar quantization; the prod variant adds a one-bit Quantized Johnson–Lindenstrauss step to correct the residual error.

TurboQuant mse stores each rotated coordinate with a scalar codebook.
TurboQuant prod adds a sign sketch plus the residual norm.
The paper reports distortion shrinking with bit width, with example MSE values near 0.36, 0.117, 0.03, and 0.009 at 1 to 4 bits.
Google Research said it tested TurboQuant on LongBench, Needle in a Haystack, ZeroSCROLLS, RULER, and L-Eval.

Why it matters

For developers running LLMs, KV cache size is often the memory bottleneck. Google says TurboQuant cut that footprint by at least 6x and improved attention-logit computation by up to 8x on Nvidia H100 GPUs compared with unquantized 32-bit keys.

The bigger point is that TurboQuant is online and data-oblivious, so it avoids the offline calibration and codebook training many older quantization schemes need. That makes it easier to slot into serving stacks for long-context chat, retrieval, and vector search.

The open question is how much of Google’s result holds across different models, workloads, and hardware. The method looks strong on paper and in Google’s own tests, but real-world adoption will depend on implementation cost and whether the memory savings show up outside benchmark runs.

// Related Articles

TurboQuant cuts KV cache memory 6x in Google tests

What changed

Get the latest AI news in your inbox

Why it matters

Prompt engineering turns codegen into a repeatable workflow

CLEAR prompts turn AI search into usable answers

Prompt engineering in 2026: the cheat sheet

GraphVid uses interaction graphs to steer video

Expanding Flow Maps let generation grow with output size

VLM-IE3D adds 3D geometry to VLMs