Tag

LLM inference

LLM inference covers the runtime side of large models: latency, throughput, memory footprint, and how KV cache, quantization, and accelerator-friendly kernels shape deployment. It matters because these choices determine whether a model is practical on GPUs, servers, or edge devices.

16 articles

Industry News/Jun 29

TurboQuant cuts LLM memory use without retraining

5 ways TurboQuant shrinks KV cache memory and speeds LLM inference, with near-lossless results around 3–4 bits on retrieval benchmarks.

Industry News/Jun 28

OpenAI’s Jalapeño chip points to faster LLM inference

1 chip, 1 partnership, and 1 new compute platform aimed at making LLM inference faster, more reliable, and more available.

Industry News/Jun 14

V100 raw GGUF vs prepacked weight cache

This compares raw GGUF Q4_K kernels and prepacked weight caches for V100 decode inference.

Industry News/Jun 12

TurboQuant makes long-context AI much cheaper

4 ways TurboQuant’s 100x KV cache cut could lower long-context AI costs, ease GPU needs, and change model serving.

Tools & Apps/Jun 10

BentoML turns model serving into Python APIs

I break down BentoML’s serving model and give you a copy-ready template for OpenAI-compatible model APIs.

Research/Jun 8

TurboQuant cuts KV cache memory 6x in Google tests

Google Research says TurboQuant compresses KV caches by over 4x, with up to 6x less memory and no loss on long-context tests.

Industry News/May 29

Tensormesh raises $20M to cut LLM memory waste

Tensormesh raised $20 million from Nvidia, AMD and CoreWeave to reduce LLM reprocessing with KV caching.

AI Agent/May 27

Why Verkor’s TurboQuant silicon IP matters more than the headline says

Verkor’s TurboQuant accelerator is a real step for LLM inference, but the bigger story is how quickly algorithm ideas are becoming silicon IP.

Research/May 18

MARLIN tackles greener LLM inference in datacenters

MARLIN uses multi-agent game-theoretic RL to make cloud LLM inference more sustainable.

Research/May 14

Taming Black-Box LLM Inference Scheduling

A scheduling approach for black-box LLM inference that uses predicted output lengths to reduce queueing friction at scale.

Research/May 12

SAGA makes AI agent GPU scheduling workflow-aware

SAGA argues GPU schedulers should treat an agent’s chained LLM calls as one workflow, not isolated requests.

Research/May 5

SpecKV tunes speculative decoding on the fly

SpecKV adapts speculative decoding’s token budget per step, using draft-model signals to beat fixed gamma across compression settings.

Research/Apr 29

TurboQuant brings near-optimal online vector quantization

TurboQuant is an online, accelerator-friendly vector quantizer that targets near-optimal MSE and inner-product distortion.

Research/Apr 29

TurboQuant, EDEN, and the citation fight

TurboQuant’s KV-cache quantization claims are under fire: EDEN authors say the paper reuses older ideas, weaker scales, and shaky benchmarks.

Research/Apr 3

TurboQuant Explained: Why Google’s New Paper Matters

Google’s TurboQuant paper targets KV cache bottlenecks with lower-bit quantization, aiming to cut LLM memory use and inference costs.

Research/Apr 3

Google's TurboQuant Cuts LLM Memory Costs

Google says TurboQuant uses QJL and PolarQuant to shrink vector-quantization memory and speed up LLM inference by up to 8x.