Tag
LLM inference
LLM inference covers the runtime side of large models: latency, throughput, memory footprint, and how KV cache, quantization, and accelerator-friendly kernels shape deployment. It matters because these choices determine whether a model is practical on GPUs, servers, or edge devices.
16 articles

TurboQuant cuts LLM memory use without retraining
5 ways TurboQuant shrinks KV cache memory and speeds LLM inference, with near-lossless results around 3–4 bits on retrieval benchmarks.

OpenAI’s Jalapeño chip points to faster LLM inference
1 chip, 1 partnership, and 1 new compute platform aimed at making LLM inference faster, more reliable, and more available.

V100 raw GGUF vs prepacked weight cache
This compares raw GGUF Q4_K kernels and prepacked weight caches for V100 decode inference.

TurboQuant makes long-context AI much cheaper
4 ways TurboQuant’s 100x KV cache cut could lower long-context AI costs, ease GPU needs, and change model serving.

BentoML turns model serving into Python APIs
I break down BentoML’s serving model and give you a copy-ready template for OpenAI-compatible model APIs.

TurboQuant cuts KV cache memory 6x in Google tests
Google Research says TurboQuant compresses KV caches by over 4x, with up to 6x less memory and no loss on long-context tests.

Tensormesh raises $20M to cut LLM memory waste
Tensormesh raised $20 million from Nvidia, AMD and CoreWeave to reduce LLM reprocessing with KV caching.

Why Verkor’s TurboQuant silicon IP matters more than the headline says
Verkor’s TurboQuant accelerator is a real step for LLM inference, but the bigger story is how quickly algorithm ideas are becoming silicon IP.

MARLIN tackles greener LLM inference in datacenters
MARLIN uses multi-agent game-theoretic RL to make cloud LLM inference more sustainable.

Taming Black-Box LLM Inference Scheduling
A scheduling approach for black-box LLM inference that uses predicted output lengths to reduce queueing friction at scale.

SAGA makes AI agent GPU scheduling workflow-aware
SAGA argues GPU schedulers should treat an agent’s chained LLM calls as one workflow, not isolated requests.

SpecKV tunes speculative decoding on the fly
SpecKV adapts speculative decoding’s token budget per step, using draft-model signals to beat fixed gamma across compression settings.

TurboQuant brings near-optimal online vector quantization
TurboQuant is an online, accelerator-friendly vector quantizer that targets near-optimal MSE and inner-product distortion.

TurboQuant, EDEN, and the citation fight
TurboQuant’s KV-cache quantization claims are under fire: EDEN authors say the paper reuses older ideas, weaker scales, and shaky benchmarks.

TurboQuant Explained: Why Google’s New Paper Matters
Google’s TurboQuant paper targets KV cache bottlenecks with lower-bit quantization, aiming to cut LLM memory use and inference costs.

Google's TurboQuant Cuts LLM Memory Costs
Google says TurboQuant uses QJL and PolarQuant to shrink vector-quantization memory and speed up LLM inference by up to 8x.