[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"article-vllm-sglang-vmlx-local-llm-runtimes-en":3,"article-related-vllm-sglang-vmlx-local-llm-runtimes-en":31,"series-tools-6b6d7ea7-7e46-49ca-9e01-ce4e55eab086":76},{"id":4,"slug":5,"title":6,"content":7,"summary":8,"source":9,"source_url":10,"author":11,"image_url":12,"cover_image":12,"category":13,"language":14,"translated_content":11,"related_article_id":15,"keywords":16,"key_takeaways":23,"views":27,"created_at":28,"published_at":29,"topic_cluster_id":30},"6b6d7ea7-7e46-49ca-9e01-ce4e55eab086","vllm-sglang-vmlx-local-llm-runtimes-en","vLLM, SGLang, vMLX: better local LLM runtimes","\u003Cp data-speakable=\"summary\">\u003Ca href=\"\u002Ftag\u002Fvllm\">vLLM\u003C\u002Fa>, SGLang, vMLX, MLC-LLM, and ExLlamaV3 target serious local LLM workflows beyond Ollama and llama.cpp.\u003C\u002Fp>\u003Cp>Most people start local \u003Ca href=\"\u002Ftag\u002Fllms\">LLMs\u003C\u002Fa> with \u003Ca href=\"https:\u002F\u002Follama.com\" target=\"_blank\" rel=\"noopener\">Ollama\u003C\u002Fa> or \u003Ca href=\"https:\u002F\u002Fgithub.com\u002Fggerganov\u002Fllama.cpp\" target=\"_blank\" rel=\"noopener\">llama.cpp\u003C\u002Fa>, and that still makes sense. But as soon as a model becomes part of a real workflow, the runtime matters as much as the model, especially for serving, batching, cache behavior, and hardware-specific acceleration.\u003C\u002Fp>\u003Ch2>What changed\u003C\u002Fh2>\u003Cp>The article argues that the local AI stack has split into specialized tools for different jobs. Instead of one default runtime, developers now choose based on whether they need an \u003Ca href=\"\u002Ftag\u002Finference\">inference\u003C\u002Fa> server, a Mac-native app layer, browser or mobile deployment, or better performance on consumer GPUs.\u003C\u002Fp>\n\u003Cfigure class=\"my-6\">\u003Cimg src=\"https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1782397979639-yxxb.png\" alt=\"vLLM, SGLang, vMLX: better local LLM runtimes\" class=\"rounded-xl w-full\" loading=\"lazy\" \u002F>\u003C\u002Ffigure>\n\u003Cp>Here are the main alternatives highlighted:\u003C\u002Fp>\u003Cul>\u003Cli>\u003Ca href=\"https:\u002F\u002Fgithub.com\u002Fvllm-project\u002Fvllm\" target=\"_blank\" rel=\"noopener\">vLLM\u003C\u002Fa> for high-throughput serving, OpenAI-compatible APIs, continuous batching, and PagedAttention.\u003C\u002Fli>\u003Cli>\u003Ca href=\"https:\u002F\u002Fdocs.sglang.ai\" target=\"_blank\" rel=\"noopener\">SGLang\u003C\u002Fa> for structured generation, repeated prompt patterns, tool use, and cache reuse.\u003C\u002Fli>\u003Cli>\u003Ca href=\"https:\u002F\u002Fgithub.com\u002Fml-explore\u002Fmlx\" target=\"_blank\" rel=\"noopener\">MLX\u003C\u002Fa> and \u003Ca href=\"https:\u002F\u002Fgithub.com\u002Fml-explore\u002Fmlx-lm\" target=\"_blank\" rel=\"noopener\">MLX-LM\u003C\u002Fa>, plus \u003Ca href=\"https:\u002F\u002Fgithub.com\u002Fowkin\u002Fvmlx\" target=\"_blank\" rel=\"noopener\">vMLX\u003C\u002Fa>, for Apple Silicon workflows.\u003C\u002Fli>\u003Cli>\u003Ca href=\"https:\u002F\u002Fmlc.ai\u002Fmlc-llm\" target=\"_blank\" rel=\"noopener\">MLC-LLM\u003C\u002Fa> and \u003Ca href=\"https:\u002F\u002Fgithub.com\u002Fmlc-ai\u002Fweb-llm\" target=\"_blank\" rel=\"noopener\">WebLLM\u003C\u002Fa> for browsers, phones, tablets, and embedded targets.\u003C\u002Fli>\u003Cli>\u003Ca href=\"https:\u002F\u002Fgithub.com\u002Fturboderp\u002Fexllamav3\" target=\"_blank\" rel=\"noopener\">ExLlamaV3\u003C\u002Fa> for consumer GPU inference, with TabbyAPI for OpenAI-style serving.\u003C\u002Fli>\u003C\u002Ful>\u003Cp>vLLM is positioned as the first step up when a local model needs to act like infrastructure. Its batching and cache management are aimed at multiple apps or agents hitting the same endpoint, not just one person chatting in a terminal.\u003C\u002Fp>\u003Cp>SGLang goes after similar workloads but with more emphasis on structured output. The article notes support for RadixAttention, prefill-decode disaggregation, speculative decoding, tensor and expert parallelism, and multi-LoRA batching, all aimed at repeated prompts and schema-driven responses.\u003C\u002Fp>\u003Ch2>Why it matters\u003C\u002Fh2>\u003Cp>For developers, the shift is practical: once a model backs tools, agents, RAG experiments, or multiple clients, the choice of runtime can change latency, VRAM use, and output reliability. A local \u003Ca href=\"\u002Ftag\u002Fllm\">LLM\u003C\u002Fa> that only answers prompts is easy to run; a local LLM that must serve APIs and return valid JSON is a different problem.\u003C\u002Fp>\n\u003Cfigure class=\"my-6\">\u003Cimg src=\"https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1782397984743-h8l5.png\" alt=\"vLLM, SGLang, vMLX: better local LLM runtimes\" class=\"rounded-xl w-full\" loading=\"lazy\" \u002F>\u003C\u002Ffigure>\n\u003Cp>The market effect is also clear. Local AI is no longer one-size-fits-all, and the stack is fragmenting around hardware and deployment target. Mac users get native paths, \u003Ca href=\"\u002Ftag\u002Fnvidia\">Nvidia\u003C\u002Fa> users get optimized serving, AMD gets its own tooling, and consumer GPUs get runtimes tuned to fit memory limits instead of enterprise assumptions.\u003C\u002Fp>\u003Cp>The takeaway is simple: Ollama and llama.cpp are still the easy defaults, but serious local AI work now starts with a question about the runtime, not just the model.\u003C\u002Fp>","Ollama and llama.cpp are the easy starts, but vLLM, SGLang, vMLX, MLC-LLM, and ExLlamaV3 fit serious local AI workflows.","www.xda-developers.com","https:\u002F\u002Fwww.xda-developers.com\u002Fmost-people-ollama-llama-cpp-local-llms-tool-serious\u002F",null,"https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1782397979639-yxxb.png","tools","en","6e7ee199-4aa2-4954-9891-4b18b04555b9",[17,18,19,20,21,22],"local LLMs","vLLM","SGLang","vMLX","MLC-LLM","ExLlamaV3",[24,25,26],"Ollama and llama.cpp remain the easiest entry points for local LLMs.","vLLM and SGLang fit agentic, multi-client, and structured-output workloads.","Mac, browser, mobile, and consumer-GPU use cases now have specialized runtimes.",0,"2026-06-25T14:32:28.375358+00:00","2026-06-25T14:32:28.364+00:00","a7343b93-37cc-4634-a2bc-707f6275bdb6",{"tags":32,"relatedLang":35,"relatedPosts":39},[33],{"name":18,"slug":34},"vllm",{"id":15,"slug":36,"title":37,"language":38},"vllm-sglang-vmlx-local-llm-runtimes-zh","vLLM、SGLang、vMLX：本地 LLM 新選擇","zh",[40,46,52,58,64,70],{"id":41,"slug":42,"title":43,"cover_image":44,"image_url":44,"created_at":45,"category":13},"96319aac-17d5-44af-bf31-54e890c13a55","cinevva-web-game-engine-guide-stack-en","Cinevva’s web-game engine guide turns picks into a stack","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1782412408146-j3pa.png","2026-06-25T18:33:04.288326+00:00",{"id":47,"slug":48,"title":49,"cover_image":50,"image_url":50,"created_at":51,"category":13},"abc842c2-f94c-4d35-8409-132d5d48f535","cursors-continue-buy-turns-copilot-into-platform-en","Cursor’s Continue buy turns Copilot into a platform","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1782411493879-bv99.png","2026-06-25T18:17:50.861849+00:00",{"id":53,"slug":54,"title":55,"cover_image":56,"image_url":56,"created_at":57,"category":13},"a899f8a7-1b6c-4bee-8e84-fb690ff2a070","update-rust-packages-ubuntu-releases-en","Update Rust packages for Ubuntu releases","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1782410579036-pvm2.png","2026-06-25T18:02:36.582105+00:00",{"id":59,"slug":60,"title":61,"cover_image":62,"image_url":62,"created_at":63,"category":13},"2856a92c-81ee-42bb-9d2c-e9542c3cd27b","prompt-versioning-belongs-in-production-en","Prompt versioning belongs in production, not in docs","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1782406069659-7nf8.png","2026-06-25T16:47:23.996479+00:00",{"id":65,"slug":66,"title":67,"cover_image":68,"image_url":68,"created_at":69,"category":13},"661b3b6f-39d1-4af6-b669-e81c174a62cd","best-paper-lists-turn-conference-noise-into-taste-en","Best-paper lists turn conference noise into taste","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1782392615746-dxxf.png","2026-06-25T13:03:02.691816+00:00",{"id":71,"slug":72,"title":73,"cover_image":74,"image_url":74,"created_at":75,"category":13},"c8bffcbc-639b-4629-8e10-3695042d80e3","sora-chart-loan-timing-choice-en","SORA chart turns loan timing into a clean choice","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1782381797082-kg89.png","2026-06-25T10:02:50.585995+00:00",[77,82,87,92,97,102,107,112,117,122],{"id":78,"slug":79,"title":80,"created_at":81},"8008f1a9-7a00-4bad-88c9-3eedc9c6b4b1","surepath-ai-mcp-policy-controls-en","SurePath AI's New MCP Policy Controls Enhance AI Security","2026-03-26T01:26:52.222015+00:00",{"id":83,"slug":84,"title":85,"created_at":86},"27e39a8f-b65d-4f7b-a875-859e2b210156","mcp-standard-ai-tools-2026-en","MCP Standard in 2026: Integrating AI Tools","2026-03-26T01:27:43.127519+00:00",{"id":88,"slug":89,"title":90,"created_at":91},"165f9a19-c92d-46ba-b3f0-7125f662921d","rag-2026-transforming-enterprise-ai-en","How RAG in 2026 is Transforming Enterprise AI","2026-03-26T01:28:11.485236+00:00",{"id":93,"slug":94,"title":95,"created_at":96},"6a2a8e6e-b956-49d8-be12-cc47bdc132b2","mastering-ai-prompts-2026-guide-en","Mastering AI Prompts: A 2026 Guide for Developers","2026-03-26T01:29:07.835148+00:00",{"id":98,"slug":99,"title":100,"created_at":101},"3ab2c67e-4664-4c67-a013-687a2f605814","garry-tan-open-sources-claude-code-toolkit-en","Garry Tan Open-Sources a Claude Code Toolkit","2026-03-26T08:26:20.245934+00:00",{"id":103,"slug":104,"title":105,"created_at":106},"66a7cbf8-7e76-41d4-9bbf-eaca9761bf69","github-ai-projects-to-watch-in-2026-en","20 GitHub AI Projects to Watch in 2026","2026-03-26T08:28:09.752027+00:00",{"id":108,"slug":109,"title":110,"created_at":111},"9f332fda-eace-448a-a292-2283951eee71","practical-github-guide-learning-ml-2026-en","A Practical GitHub Guide to Learning ML in 2026","2026-03-27T01:16:50.125678+00:00",{"id":113,"slug":114,"title":115,"created_at":116},"1b1f637d-0f4d-42bd-974b-07b53829144d","aiml-2026-student-ai-ml-lab-repo-review-en","AIML-2026 Is a Bare-Bones Student Lab Repo","2026-03-27T01:21:51.661231+00:00",{"id":118,"slug":119,"title":120,"created_at":121},"6d1bf3f6-e191-4d30-b55b-8a0722fa6afe","ai-trending-github-repos-and-research-feeds-en","AI Trending Tracks Repos and Research Feeds","2026-03-27T01:31:35.709532+00:00",{"id":123,"slug":124,"title":125,"created_at":126},"010539a1-4c3a-4bd3-937a-26616422ee0d","awesome-ai-for-science-research-tools-map-en","Awesome AI for Science Is Becoming a Real Research Map","2026-03-27T01:46:50.89513+00:00"]