Tag

TurboQuant

TurboQuant targets the KV-cache bottleneck in LLM inference, using low-bit and vector quantization to reduce memory pressure and server cost. The topic also connects to QJL, PolarQuant, benchmark fairness, and citation disputes.

25 articles

Industry News/Jun 29

TurboQuant cuts LLM memory use without retraining

5 ways TurboQuant shrinks KV cache memory and speeds LLM inference, with near-lossless results around 3–4 bits on retrieval benchmarks.

Industry News/Jun 25

AtomicBot’s llama.cpp fork boosts throughput on two fronts

4 ways AtomicBot’s llama.cpp fork speeds up Gemma 4 and Qwen 3.6, with matrix-bench gains up to 30-50% on the right setup.

Research/Jun 19

TurboQuant does not hurt search quality at equal byte budgets

TurboQuant cuts vector memory by about 20× without meaningful search-quality loss when compared at equal bytes.

Industry News/Jun 15

TurboVec cuts 10M-vector RAM to 4GB

TurboVec compresses 10M vectors from 31GB to 4GB and removes training from vector search.

Industry News/Jun 13

TurboQuant on AMD GPUs cuts KV-cache latency

TurboQuant on AMD GPUs improves long-context LLM serving with up to 3.6x speedup and far lower KV-cache pressure.

Industry News/Jun 12

TurboQuant makes long-context AI much cheaper

4 ways TurboQuant’s 100x KV cache cut could lower long-context AI costs, ease GPU needs, and change model serving.

Research/Jun 8

TurboQuant cuts KV cache memory 6x in Google tests

Google Research says TurboQuant compresses KV caches by over 4x, with up to 6x less memory and no loss on long-context tests.

Blockchain & Web3/Jun 4

Tether’s TurboQuant cuts AI memory use 5x

Tether released TurboQuant in QVAC SDK 0.12.0, claiming up to 5x lower AI memory use for local sessions on laptops and phones.

Tools & Apps/Jun 4

Why Tether Is Right to Push Local AI Memory Into Everyday Devices

Tether’s TurboQuant matters because it makes long-context AI practical on local devices, not just in data centers.

Industry News/May 31

5 TurboQuant lessons for vector search teams

5 takeaways on Qdrant TurboQuant: how rotation changes compression, where recall holds up, and when safer quantizers fit better.

Industry News/May 28

Memory Stocks Face a New AI Reality Check

Memory chip stocks are soaring on AI demand, but investors warn the cycle can turn fast if supply rises or model efficiency improves.

AI Agent/May 27

Why Verkor’s TurboQuant silicon IP matters more than the headline says

Verkor’s TurboQuant accelerator is a real step for LLM inference, but the bigger story is how quickly algorithm ideas are becoming silicon IP.

Tools & Apps/May 23

Why llama.cpp should treat TurboQuant as the new default path

TurboQuant is the right direction for llama.cpp because asymmetric KV compression cuts memory without breaking compatibility.

Tools & Apps/May 20

TurboQuant turns vLLM KV cache into 3-bit storage

I break down TurboQuant’s vLLM cache compression and give you a copy-ready setup for 3-bit KV cache and fallback paths.

Tools & Apps/May 20

Why KV-cache compression will decide edge AI inference

TurboQuant-style KV-cache compression is the real bottleneck-breaker for edge AI inference.

Industry News/May 20

5 KV cache takeaways for llama.cpp users

5 takeaways from TurboQuant: under-3-bit KV cache compression, memory savings, and the tradeoffs llama.cpp users should watch.

Research/May 15

TurboQuant and the SEO Shift for Small Sites

TurboQuant is a rumored Google search system that could widen the pool of pages ranked, giving smaller sites a better shot.

Research/May 15

TurboQuant vs FP8: vLLM’s first broad test

vLLM found FP8 KV-cache quantization beats TurboQuant on speed, while TurboQuant’s strongest variants hurt accuracy.

Research/May 6

Why TurboQuant changes the KV cache debate

TurboQuant makes KV cache compression a theoretical win, not just an engineering trick.

Research/Apr 29

TurboQuant, EDEN, and the citation fight

TurboQuant’s KV-cache quantization claims are under fire: EDEN authors say the paper reuses older ideas, weaker scales, and shaky benchmarks.

Research/Apr 3

TurboQuant cuts memory use 6x without accuracy loss

Google Research’s TurboQuant claims 6x less memory and 8x faster inference with no accuracy loss, jolting AI inference economics.

Research/Apr 3

TurboQuant Explained: Why Google’s New Paper Matters

Google’s TurboQuant paper targets KV cache bottlenecks with lower-bit quantization, aiming to cut LLM memory use and inference costs.

Research/Apr 3

Google's TurboQuant Cuts LLM Memory Costs

Google says TurboQuant uses QJL and PolarQuant to shrink vector-quantization memory and speed up LLM inference by up to 8x.

Tools & Apps/Apr 3

TurboQuant, Fast Cold Starts, and Rust on GPUs

TurboQuant cuts KV cache use 4.6x, GPU state restoration slashes cold starts, and Rust is moving deeper into CUDA work.

Research/Apr 2

TurboQuant Won’t Fix the Memory Crunch

Google’s TurboQuant can cut KV-cache memory use 6x, but longer contexts may keep DRAM and NAND demand climbing.