[RSCH] 3 min readOraCore Editors

TurboQuant cuts KV cache memory 6x in Google tests

Google Research says TurboQuant compresses KV caches by over 4x, with up to 6x less memory and no loss on long-context tests.

Share LinkedIn
TurboQuant cuts KV cache memory 6x in Google tests

TurboQuant is Google Research’s 2025 vector-quantization method for compressing KV caches and embeddings.

Google Research’s TurboQuant is a 2025 online vector-quantization method built to shrink high-dimensional vectors without breaking their structure. In tests on long-context LLM workloads, the team said it matched a full-precision baseline while delivering more than 4x compression.

項目數值
Proposal year2025
KV-cache memory reductionAt least 6x
Attention-logit speedup on H100Up to 8x
Compression in long-context testsMore than 4x
KV-cache quality threshold3.5 bits per channel
Benchmark context length4,000 to 104,000 tokens

What changed

Get the latest AI news in your inbox

Weekly picks of model releases, tools, and deep dives — no spam, unsubscribe anytime.

No spam. Unsubscribe at any time.

TurboQuant was proposed by Amir Zandieh, Majid Daliri, Majid Hadian, and Vahab Mirrokni in the paper “TurboQuant: Online Vector Quantization with Near-optimal Distortion Rate.” The method targets three places where vector storage gets expensive: LLM inference, key-value cache compression, and nearest-neighbor search.

TurboQuant cuts KV cache memory 6x in Google tests

The algorithm comes in two modes. TurboQuant mse optimizes mean squared error, while TurboQuant prod aims at unbiased inner-product estimates. Both versions use a random rotation, then scalar quantization; the prod variant adds a one-bit Quantized Johnson–Lindenstrauss step to correct the residual error.

  • TurboQuant mse stores each rotated coordinate with a scalar codebook.
  • TurboQuant prod adds a sign sketch plus the residual norm.
  • The paper reports distortion shrinking with bit width, with example MSE values near 0.36, 0.117, 0.03, and 0.009 at 1 to 4 bits.
  • Google Research said it tested TurboQuant on LongBench, Needle in a Haystack, ZeroSCROLLS, RULER, and L-Eval.

Why it matters

For developers running LLMs, KV cache size is often the memory bottleneck. Google says TurboQuant cut that footprint by at least 6x and improved attention-logit computation by up to 8x on Nvidia H100 GPUs compared with unquantized 32-bit keys.

TurboQuant cuts KV cache memory 6x in Google tests

The bigger point is that TurboQuant is online and data-oblivious, so it avoids the offline calibration and codebook training many older quantization schemes need. That makes it easier to slot into serving stacks for long-context chat, retrieval, and vector search.

The open question is how much of Google’s result holds across different models, workloads, and hardware. The method looks strong on paper and in Google’s own tests, but real-world adoption will depend on implementation cost and whether the memory savings show up outside benchmark runs.