[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"article-turboquant-cuts-kv-cache-memory-6x-google-tests-en":3,"article-related-turboquant-cuts-kv-cache-memory-6x-google-tests-en":30,"series-research-9f0c9505-6d75-411c-ba46-2382e8f295a5":83},{"id":4,"slug":5,"title":6,"content":7,"summary":8,"source":9,"source_url":10,"author":11,"image_url":12,"cover_image":12,"category":13,"language":14,"translated_content":11,"related_article_id":15,"keywords":16,"key_takeaways":22,"views":26,"created_at":27,"published_at":28,"topic_cluster_id":29},"9f0c9505-6d75-411c-ba46-2382e8f295a5","turboquant-cuts-kv-cache-memory-6x-google-tests-en","TurboQuant cuts KV cache memory 6x in Google tests","\u003Cp data-speakable=\"summary\">\u003Ca href=\"\u002Ftag\u002Fturboquant\">TurboQuant\u003C\u002Fa> is Google Research’s 2025 vector-quantization method for compressing KV caches and embeddings.\u003C\u002Fp>\u003Cp>Google Research’s \u003Ca href=\"https:\u002F\u002Fen.wikipedia.org\u002Fwiki\u002FTurboQuant\" target=\"_blank\" rel=\"noopener\">TurboQuant\u003C\u002Fa> is a 2025 online vector-quantization method built to shrink high-dimensional vectors without breaking their structure. In tests on long-context LLM workloads, the team said it matched a full-precision baseline while delivering more than 4x compression.\u003C\u002Fp>\u003Ctable>\u003Cthead>\u003Ctr>\u003Cth>項目\u003C\u002Fth>\u003Cth>數值\u003C\u002Fth>\u003C\u002Ftr>\u003C\u002Fthead>\u003Ctbody>\u003Ctr>\u003Ctd>Proposal year\u003C\u002Ftd>\u003Ctd>2025\u003C\u002Ftd>\u003C\u002Ftr>\u003Ctr>\u003Ctd>KV-cache memory reduction\u003C\u002Ftd>\u003Ctd>At least 6x\u003C\u002Ftd>\u003C\u002Ftr>\u003Ctr>\u003Ctd>Attention-logit speedup on H100\u003C\u002Ftd>\u003Ctd>Up to 8x\u003C\u002Ftd>\u003C\u002Ftr>\u003Ctr>\u003Ctd>Compression in long-context tests\u003C\u002Ftd>\u003Ctd>More than 4x\u003C\u002Ftd>\u003C\u002Ftr>\u003Ctr>\u003Ctd>KV-cache quality threshold\u003C\u002Ftd>\u003Ctd>3.5 bits per channel\u003C\u002Ftd>\u003C\u002Ftr>\u003Ctr>\u003Ctd>Benchmark context length\u003C\u002Ftd>\u003Ctd>4,000 to 104,000 tokens\u003C\u002Ftd>\u003C\u002Ftr>\u003C\u002Ftbody>\u003C\u002Ftable>\u003Ch2>What changed\u003C\u002Fh2>\u003Cp>TurboQuant was proposed by Amir Zandieh, Majid Daliri, Majid Hadian, and Vahab Mirrokni in the paper “TurboQuant: Online Vector Quantization with Near-optimal Distortion Rate.” The method targets three places where vector storage gets expensive: LLM \u003Ca href=\"\u002Ftag\u002Finference\">inference\u003C\u002Fa>, key-value cache compression, and nearest-neighbor search.\u003C\u002Fp>\n\u003Cfigure class=\"my-6\">\u003Cimg src=\"https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1780906679116-fqdo.png\" alt=\"TurboQuant cuts KV cache memory 6x in Google tests\" class=\"rounded-xl w-full\" loading=\"lazy\" \u002F>\u003C\u002Ffigure>\n\u003Cp>The algorithm comes in two modes. TurboQuant mse optimizes mean squared error, while TurboQuant prod aims at unbiased inner-product estimates. Both versions use a random rotation, then scalar quantization; the prod variant adds a one-bit Quantized Johnson–Lindenstrauss step to correct the residual error.\u003C\u002Fp>\u003Cul>\u003Cli>TurboQuant mse stores each rotated coordinate with a scalar codebook.\u003C\u002Fli>\u003Cli>TurboQuant prod adds a sign sketch plus the residual norm.\u003C\u002Fli>\u003Cli>The paper reports distortion shrinking with bit width, with example MSE values near 0.36, 0.117, 0.03, and 0.009 at 1 to 4 bits.\u003C\u002Fli>\u003Cli>Google Research said it tested TurboQuant on LongBench, Needle in a Haystack, ZeroSCROLLS, RULER, and L-Eval.\u003C\u002Fli>\u003C\u002Ful>\u003Ch2>Why it matters\u003C\u002Fh2>\u003Cp>For developers running LLMs, \u003Ca href=\"\u002Ftag\u002Fkv-cache\">KV cache\u003C\u002Fa> size is often the memory bottleneck. Google says TurboQuant cut that footprint by at least 6x and improved attention-logit computation by up to 8x on \u003Ca href=\"\u002Ftag\u002Fnvidia\">Nvidia\u003C\u002Fa> H100 GPUs compared with unquantized 32-bit keys.\u003C\u002Fp>\n\u003Cfigure class=\"my-6\">\u003Cimg src=\"https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1780906686210-41g2.png\" alt=\"TurboQuant cuts KV cache memory 6x in Google tests\" class=\"rounded-xl w-full\" loading=\"lazy\" \u002F>\u003C\u002Ffigure>\n\u003Cp>The bigger point is that TurboQuant is online and data-oblivious, so it avoids the offline calibration and codebook training many older quantization schemes need. That makes it easier to slot into serving stacks for long-context chat, retrieval, and vector search.\u003C\u002Fp>\u003Cp>The open question is how much of Google’s result holds across different models, workloads, and hardware. The method looks strong on paper and in Google’s own tests, but real-world adoption will depend on implementation cost and whether the memory savings show up outside \u003Ca href=\"\u002Ftag\u002Fbenchmark\">benchmark\u003C\u002Fa> runs.\u003C\u002Fp>","Google Research says TurboQuant compresses KV caches by over 4x, with up to 6x less memory and no loss on long-context tests.","en.wikipedia.org","https:\u002F\u002Fen.wikipedia.org\u002Fwiki\u002FTurboQuant",null,"https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1780906679116-fqdo.png","research","en","6f25a29c-cbb8-4f53-9af7-1656b394333a",[17,18,19,20,21],"TurboQuant","KV cache","vector quantization","Google Research","LLM inference",[23,24,25],"Google’s TurboQuant compresses KV caches and embeddings with online vector quantization.","The paper reports more than 4x compression and near-baseline accuracy on long-context tests.","Google says the method can cut KV-cache memory by at least 6x and speed attention-logit work on H100 GPUs.",0,"2026-06-08T08:17:22.276769+00:00","2026-06-08T08:17:22.268+00:00","3103988e-c4fe-45e3-98ab-846500c9d507",{"tags":31,"relatedLang":42,"relatedPosts":46},[32,34,36,38,40],{"name":20,"slug":33},"google-research",{"name":18,"slug":35},"kv-cache",{"name":21,"slug":37},"llm-inference",{"name":17,"slug":39},"turboquant",{"name":19,"slug":41},"vector-quantization",{"id":15,"slug":43,"title":44,"language":45},"turboquant-cuts-kv-cache-memory-6x-google-tests-zh","TurboQuant 在 Google 測試中省下 6x KV 快取","zh",[47,53,59,65,71,77],{"id":48,"slug":49,"title":50,"cover_image":51,"image_url":51,"created_at":52,"category":13},"1d84a671-4772-43ea-af56-3d447893a94c","memdreamer-long-video-understanding-memory-retrieval-en","MemDreamer tackles long-video overload","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1780902190707-ajbq.png","2026-06-08T07:02:32.833899+00:00",{"id":54,"slug":55,"title":56,"cover_image":57,"image_url":57,"created_at":58,"category":13},"0984f351-871a-41a6-8093-c8b600fb3555","agentopia-10-year-agent-society-simulation-en","Agentopia simulates 10 years of agent society","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1780901285014-6rbt.png","2026-06-08T06:47:32.43537+00:00",{"id":60,"slug":61,"title":62,"cover_image":63,"image_url":63,"created_at":64,"category":13},"c89012a2-8d2a-4abc-8325-2a6249828718","llms-stumble-counterintuitive-probability-en","LLMs stumble on counterintuitive probability","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1780900377596-25f1.png","2026-06-08T06:32:29.37299+00:00",{"id":66,"slug":67,"title":68,"cover_image":69,"image_url":69,"created_at":70,"category":13},"e17d7e2f-2b15-493b-9bed-fe95abc7a20d","bento-webassembly-memory-compartments-en","Bento turns WebAssembly memory into compartments","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1780811290637-auhc.png","2026-06-07T05:47:46.129275+00:00",{"id":72,"slug":73,"title":74,"cover_image":75,"image_url":75,"created_at":76,"category":13},"99349700-bdd6-4a02-9354-17ff20598452","bis-stablecoin-usable-buffers-regulation-en","BIS turns stablecoin rules into usable buffers","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1780737504361-by41.png","2026-06-06T09:17:56.826856+00:00",{"id":78,"slug":79,"title":80,"cover_image":81,"image_url":81,"created_at":82,"category":13},"5cf69bca-6c4c-46e0-a4b7-b0a59835c548","prevent-catastrophic-forgetting-llm-fine-tuning-en","How to Prevent Catastrophic Forgetting in LLM Fine-Tuning","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1780730282480-iwp2.png","2026-06-06T07:17:32.623791+00:00",[84,89,94,99,104,109,114,119,124,129],{"id":85,"slug":86,"title":87,"created_at":88},"a2715e72-1fe8-41b3-abb1-d0cf1f710189","ai-predictions-2026-big-changes-en","AI Predictions for 2026: Brace for Big Changes","2026-03-26T01:25:07.788356+00:00",{"id":90,"slug":91,"title":92,"created_at":93},"8404bd7b-4c2f-4109-9ec4-baf29d88af2b","ml-papers-of-the-week-github-research-desk-en","ML Papers of the Week Turns GitHub Into a Research Desk","2026-03-27T01:11:39.480259+00:00",{"id":95,"slug":96,"title":97,"created_at":98},"87897a94-8065-4464-a016-1f23e89e17cc","ai-ml-conferences-to-watch-in-2026-en","AI\u002FML Conferences to Watch in 2026","2026-03-27T01:51:54.184108+00:00",{"id":100,"slug":101,"title":102,"created_at":103},"6f1987cf-25f3-47a4-b3e6-db0997695be8","openclaw-agents-manipulated-self-sabotage-en","OpenClaw Agents Can Be Manipulated Into Failure","2026-03-28T03:03:18.899465+00:00",{"id":105,"slug":106,"title":107,"created_at":108},"a53571ad-735a-4178-9f93-cb09b699d99c","vega-driving-language-instructions-en","Vega: Driving with Natural Language Instructions","2026-03-28T14:54:04.698882+00:00",{"id":110,"slug":111,"title":112,"created_at":113},"a34581d6-f36e-46da-88bb-582fb3e7425c","personalizing-autonomous-driving-styles-en","Drive My Way: Personalizing Autonomous Driving Styles","2026-03-28T14:54:26.148181+00:00",{"id":115,"slug":116,"title":117,"created_at":118},"2bc1ad7f-26ce-4f02-9885-803b35fd229d","training-knowledge-bases-writeback-rag-en","Training Knowledge Bases with WriteBack-RAG","2026-03-28T14:54:45.643433+00:00",{"id":120,"slug":121,"title":122,"created_at":123},"71adc507-3c54-4605-bbe2-c966acd6187e","packforcing-long-video-generation-en","PackForcing: Efficient Long-Video Generation Method","2026-03-28T14:55:02.646943+00:00",{"id":125,"slug":126,"title":127,"created_at":128},"675942ef-b9ec-4c5f-a997-381250b6eacb","pixelsmile-facial-expression-editing-en","PixelSmile Framework Enhances Facial Expression Editing","2026-03-28T14:55:20.633463+00:00",{"id":130,"slug":131,"title":132,"created_at":133},"6954fa2b-8b66-4839-884b-e46f89fa1bc3","adaptive-block-scaled-data-types-en","IF4: Smarter 4-Bit Quantization That Adapts to Your Data","2026-03-31T06:00:36.65963+00:00"]