Tag
vLLM
vLLM is a high-throughput inference engine for large language models, built around PagedAttention, KV cache management, and continuous batching. It matters for chat services, RAG pipelines, batch generation, and multi-model GPU deployment.
14 articles

7 open-source AI projects developers need in 2026
Seven open-source AI projects are replacing paid APIs, from local inference to browser agents, and they’re already pulling huge GitHub numbers.

vLLM, SGLang, vMLX: better local LLM runtimes
Ollama and llama.cpp are the easy starts, but vLLM, SGLang, vMLX, MLC-LLM, and ExLlamaV3 fit serious local AI workflows.

UltraQuant: 4-bit KV caching for long agents
UltraQuant shows 4-bit KV caching can speed long, multi-turn agent serving while keeping more context resident.

llama.cpp vs vLLM: Choosing the right local LLM engine
llama.cpp and vLLM are local LLM inference engines for different hardware and traffic patterns.

Deploy MiniMax M3 with vLLM OpenAI API
Run MiniMax M3 locally with vLLM and expose an OpenAI-compatible API.

Red Hat AI turns telco AI into a stack
Mavenir and Red Hat show how telcos can package AI with MLOps, vLLM inference, and AgentOps on Kubernetes.

Self-host MiniMax M3 on GPU cloud
MiniMax M3 brings 229.9B MoE weights, 1M context, and multimodal output, but it needs serious GPU memory to run.

Open-source AI software is winning on infrastructure, not hype
Open-source AI software is winning because it now powers the core infrastructure for building, serving, and shipping models.

TurboQuant on AMD GPUs cuts KV-cache latency
TurboQuant on AMD GPUs improves long-context LLM serving with up to 3.6x speedup and far lower KV-cache pressure.

TurboQuant turns vLLM KV cache into 3-bit storage
I break down TurboQuant’s vLLM cache compression and give you a copy-ready setup for 3-bit KV cache and fallback paths.

MiniMax M2 opens up cheap agentic coding
MiniMax open-sourced M2, a model for agents and code that costs $0.30 per million input tokens and is free for a limited time.

TurboQuant vs FP8: vLLM’s first broad test
vLLM found FP8 KV-cache quantization beats TurboQuant on speed, while TurboQuant’s strongest variants hurt accuracy.

Gemma 4 assistant models get faster draft tokens
Gemma 4 E2B and E4B assistant models use centroid masking to cut lm_head work about 45x with little quality loss.

Awesome Open Source AI: the best projects list
This GitHub list curates battle-tested open-source AI tools, models, and infra, from PyTorch to vLLM, with 2,486 stars.