Back to home

Tag

vLLM

vLLM is a high-throughput inference engine for large language models, built around PagedAttention, KV cache management, and continuous batching. It matters for chat services, RAG pipelines, batch generation, and multi-model GPU deployment.

14 articles

7 open-source AI projects developers need in 2026
Tools & Apps/Jun 28

7 open-source AI projects developers need in 2026

Seven open-source AI projects are replacing paid APIs, from local inference to browser agents, and they’re already pulling huge GitHub numbers.

vLLM, SGLang, vMLX: better local LLM runtimes
Tools & Apps/Jun 25

vLLM, SGLang, vMLX: better local LLM runtimes

Ollama and llama.cpp are the easy starts, but vLLM, SGLang, vMLX, MLC-LLM, and ExLlamaV3 fit serious local AI workflows.

UltraQuant: 4-bit KV caching for long agents
Research/Jun 25

UltraQuant: 4-bit KV caching for long agents

UltraQuant shows 4-bit KV caching can speed long, multi-turn agent serving while keeping more context resident.

llama.cpp vs vLLM: Choosing the right local LLM engine
Industry News/Jun 22

llama.cpp vs vLLM: Choosing the right local LLM engine

llama.cpp and vLLM are local LLM inference engines for different hardware and traffic patterns.

Deploy MiniMax M3 with vLLM OpenAI API
Tools & Apps/Jun 20

Deploy MiniMax M3 with vLLM OpenAI API

Run MiniMax M3 locally with vLLM and expose an OpenAI-compatible API.

Red Hat AI turns telco AI into a stack
Industry News/Jun 20

Red Hat AI turns telco AI into a stack

Mavenir and Red Hat show how telcos can package AI with MLOps, vLLM inference, and AgentOps on Kubernetes.

Self-host MiniMax M3 on GPU cloud
Model Releases/Jun 18

Self-host MiniMax M3 on GPU cloud

MiniMax M3 brings 229.9B MoE weights, 1M context, and multimodal output, but it needs serious GPU memory to run.

Open-source AI software is winning on infrastructure, not hype
Tools & Apps/Jun 17

Open-source AI software is winning on infrastructure, not hype

Open-source AI software is winning because it now powers the core infrastructure for building, serving, and shipping models.

TurboQuant on AMD GPUs cuts KV-cache latency
Industry News/Jun 13

TurboQuant on AMD GPUs cuts KV-cache latency

TurboQuant on AMD GPUs improves long-context LLM serving with up to 3.6x speedup and far lower KV-cache pressure.

TurboQuant turns vLLM KV cache into 3-bit storage
Tools & Apps/May 20

TurboQuant turns vLLM KV cache into 3-bit storage

I break down TurboQuant’s vLLM cache compression and give you a copy-ready setup for 3-bit KV cache and fallback paths.

MiniMax M2 opens up cheap agentic coding
Model Releases/May 18

MiniMax M2 opens up cheap agentic coding

MiniMax open-sourced M2, a model for agents and code that costs $0.30 per million input tokens and is free for a limited time.

TurboQuant vs FP8: vLLM’s first broad test
Research/May 15

TurboQuant vs FP8: vLLM’s first broad test

vLLM found FP8 KV-cache quantization beats TurboQuant on speed, while TurboQuant’s strongest variants hurt accuracy.

Gemma 4 assistant models get faster draft tokens
Tools & Apps/May 9

Gemma 4 assistant models get faster draft tokens

Gemma 4 E2B and E4B assistant models use centroid masking to cut lm_head work about 45x with little quality loss.

Awesome Open Source AI: the best projects list
Tools & Apps/Apr 12

Awesome Open Source AI: the best projects list

This GitHub list curates battle-tested open-source AI tools, models, and infra, from PyTorch to vLLM, with 2,486 stars.