Tag
vector quantization
Vector quantization compresses high-dimensional embeddings into compact codes, reducing memory and bandwidth in LLM KV caches, vector search, and inference pipelines. Recent work such as TurboQuant focuses on online, accelerator-friendly schemes that balance MSE, inner-product distortion, and throughput.
3 articles

Research/Jun 8
TurboQuant cuts KV cache memory 6x in Google tests
Google Research says TurboQuant compresses KV caches by over 4x, with up to 6x less memory and no loss on long-context tests.

Research/Apr 29
TurboQuant brings near-optimal online vector quantization
TurboQuant is an online, accelerator-friendly vector quantizer that targets near-optimal MSE and inner-product distortion.

Research/Apr 3
Google's TurboQuant Cuts LLM Memory Costs
Google says TurboQuant uses QJL and PolarQuant to shrink vector-quantization memory and speed up LLM inference by up to 8x.