[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"article-vllm-sglang-vmlx-local-llm-runtimes-zh":3,"article-related-vllm-sglang-vmlx-local-llm-runtimes-zh":31,"series-tools-6e7ee199-4aa2-4954-9891-4b18b04555b9":76},{"id":4,"slug":5,"title":6,"content":7,"summary":8,"source":9,"source_url":10,"author":11,"image_url":12,"cover_image":12,"category":13,"language":14,"translated_content":11,"related_article_id":15,"keywords":16,"key_takeaways":23,"views":27,"created_at":28,"published_at":29,"topic_cluster_id":30},"6e7ee199-4aa2-4954-9891-4b18b04555b9","vllm-sglang-vmlx-local-llm-runtimes-zh","vLLM、SGLang、vMLX：本地 LLM 新選擇","\u003Cp data-speakable=\"summary\">vLLM、SGLang、vMLX、MLC-LLM 和 ExLlamaV3，正在把本地 LLM \u003Ca href=\"\u002Fnews\u002Fopenai-sora-hardware-enterprise-video-zh\">工作流\u003C\u002Fa>從單機聊天，推向可部署、可批次處理的服務形態。\u003C\u002Fp>\u003Cp>多數人接觸本地\u003Ca href=\"\u002Fnews\u002Fcodex-third-party-model-integration-guide-zh\">模型\u003C\u002Fa>，第一站仍是 \u003Ca href=\"https:\u002F\u002Follama.com\" target=\"_blank\" rel=\"noopener\">Ollama\u003C\u002Fa> 或 \u003Ca href=\"https:\u002F\u002Fgithub.com\u002Fggerganov\u002Fllama.cpp\" target=\"_blank\" rel=\"noopener\">llama.cpp\u003C\u002Fa>。這兩個工具依然好用，但一旦模型要接 API、跑多個客戶端、或支援 \u003Ca href=\"\u002Ftag\u002Fagent\">agent\u003C\u002Fa>，runtime 就不再只是背景角色。\u003C\u002Fp>\u003Cp>這篇整理的重點很直接：本地 \u003Ca href=\"\u002Ftag\u002Fai-工具\">AI 工具\u003C\u002Fa>鏈已經分工。不同 runtime 各自對準伺服、結構化輸出、\u003Ca href=\"\u002Ftag\u002Fapple\">Apple\u003C\u002Fa> Silicon、瀏覽器端、手機端，或消費級 GPU，沒有單一解法能吃下所有場景。\u003C\u002Fp>\u003Ctable>\u003Cthead>\u003Ctr>\u003Cth>項目\u003C\u002Fth>\u003Cth>數值\u003C\u002Fth>\u003C\u002Ftr>\u003C\u002Fthead>\u003Ctbody>\u003Ctr>\u003Ctd>主要 runtime\u003C\u002Ftd>\u003Ctd>5 個\u003C\u002Ftd>\u003C\u002Ftr>\u003Ctr>\u003Ctd>Apple Silicon 路線\u003C\u002Ftd>\u003Ctd>MLX、MLX-LM、vMLX\u003C\u002Ftd>\u003C\u002Ftr>\u003Ctr>\u003Ctd>瀏覽器／行動裝置路線\u003C\u002Ftd>\u003Ctd>MLC-LLM、WebLLM\u003C\u002Ftd>\u003C\u002Ftr>\u003Ctr>\u003Ctd>消費級 GPU 路線\u003C\u002Ftd>\u003Ctd>ExLlamaV3\u003C\u002Ftd>\u003C\u002Ftr>\u003Ctr>\u003Ctd>伺服取向路線\u003C\u002Ftd>\u003Ctd>vLLM、SGLang\u003C\u002Ftd>\u003C\u002Ftr>\u003C\u002Ftbody>\u003C\u002Ftable>\u003Ch2>發生了什麼\u003C\u002Fh2>\u003Cp>文章把本地 LLM 生態切成幾條明確路線。\u003Ca href=\"https:\u002F\u002Fgithub.com\u002Fvllm-project\u002Fvllm\" target=\"_blank\" rel=\"noopener\">vLLM\u003C\u002Fa> 主打高吞吐伺服、\u003Ca href=\"\u002Ftag\u002Fopenai\">OpenAI\u003C\u002Fa> 相容 API、continuous batching 和 PagedAttention，適合多個應用同時打同一個端點。\u003C\u002Fp>\n\u003Cfigure class=\"my-6\">\u003Cimg src=\"https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1782397982700-oxni.png\" alt=\"vLLM、SGLang、vMLX：本地 LLM 新選擇\" class=\"rounded-xl w-full\" loading=\"lazy\" \u002F>\u003C\u002Ffigure>\n\u003Cp>\u003Ca href=\"https:\u002F\u002Fdocs.sglang.ai\" target=\"_blank\" rel=\"noopener\">SGLang\u003C\u002Fa> 則更偏向結構化生成。它把 RadixAttention、prefill-decode disaggregation、speculative decoding、tensor 與 expert parallelism、multi-LoRA batching 放在一起，目標是重複提示詞、schema 驅動輸出與工具調用。\u003C\u002Fp>\u003Cp>如果工作負載更靠近 Apple 生態，\u003Ca href=\"https:\u002F\u002Fgithub.com\u002Fml-explore\u002Fmlx\" target=\"_blank\" rel=\"noopener\">MLX\u003C\u002Fa>、\u003Ca href=\"https:\u002F\u002Fgithub.com\u002Fml-explore\u002Fmlx-lm\" target=\"_blank\" rel=\"noopener\">MLX-LM\u003C\u002Fa> 和 \u003Ca href=\"https:\u002F\u002Fgithub.com\u002Fowkin\u002Fvmlx\" target=\"_blank\" rel=\"noopener\">vMLX\u003C\u002Fa> 會更合適。它們的重點不是把\u003Ca href=\"\u002Fnews\u002Fminimax-m3-open-weight-frontier-models-matter-zh\">模型\u003C\u002Fa>塞進通用伺服器，而是貼著 Apple Silicon 的記憶體與算力特性來跑。\u003C\u002Fp>\u003Cp>另一端是跨裝置部署。\u003Ca href=\"https:\u002F\u002Fmlc.ai\u002Fmlc-llm\" target=\"_blank\" rel=\"noopener\">MLC-LLM\u003C\u002Fa> 與 \u003Ca href=\"https:\u002F\u002Fgithub.com\u002Fmlc-ai\u002Fweb-llm\" target=\"_blank\" rel=\"noopener\">WebLLM\u003C\u002Fa> 面向瀏覽器、手機、平板與嵌入式設備；\u003Ca href=\"https:\u002F\u002Fgithub.com\u002Fturboderp\u002Fexllamav3\" target=\"_blank\" rel=\"noopener\">ExLlamaV3\u003C\u002Fa> 則鎖定消費級 GPU，並可搭配 TabbyAPI 做成 OpenAI 風格服務。\u003C\u002Fp>\u003Cul>\u003Cli>vLLM：伺服與批次吞吐優先。\u003C\u002Fli>\u003Cli>SGLang：結構化輸出與重複工作流優先。\u003C\u002Fli>\u003Cli>MLX／vMLX：Apple Silicon 原生路線。\u003C\u002Fli>\u003Cli>MLC-LLM／WebLLM／ExLlamaV3：端側與消費級硬體優先。\u003C\u002Fli>\u003C\u002Ful>\u003Cp>這代表選型邏輯已經變了。以前問的是「哪個模型能跑」，現在更常問「哪個 runtime 能把這個模型穩定地接到產品裡」。\u003C\u002Fp>\u003Ch2>為什麼重要\u003C\u002Fh2>\u003Cp>對開發者來說，runtime 會直接影響延遲、VRAM 佔用、吞吐量與輸出穩定性。當本地模型只是聊天玩具時，Ollama 或 llama.cpp 很夠用；但一旦要服務多個用戶、處理 RAG、或回傳可驗證的 JSON，runtime 的差異就會被放大。\u003C\u002Fp>\n\u003Cfigure class=\"my-6\">\u003Cimg src=\"https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1782397995190-bz3b.png\" alt=\"vLLM、SGLang、vMLX：本地 LLM 新選擇\" class=\"rounded-xl w-full\" loading=\"lazy\" \u002F>\u003C\u002Ffigure>\n\u003Cp>這也解釋了為何本地 AI 正在分裂成多條硬體路線。\u003Ca href=\"\u002Ftag\u002Fnvidia\">Nvidia\u003C\u002Fa> 用戶可以追求高吞吐伺服，Mac 用戶有原生路徑，AMD 與消費級 GPU 則需要更貼近記憶體限制的工具；同一個模型，在不同 runtime 上，體驗可能差很多。\u003C\u002Fp>\u003Cp>對產業來說，這不是單純的工具更新，而是部署思維的變化。模型本身越來越像可替換零件，真正拉開差距的，反而是 batching、cache、parallelism，以及 runtime 對硬體的理解。\u003C\u002Fp>\u003Cp>結論很硬：本地 LLM 已經不是「裝哪個模型」的問題，而是「你要把它放在哪種 runtime 上」的問題。\u003C\u002Fp>","本地 LLM 工具鏈開始分流。vLLM、SGLang、vMLX、MLC-LLM 與 ExLlamaV3，正把重點從「能跑」推向「怎麼跑得更快、更穩、更貼近硬體」。","www.xda-developers.com","https:\u002F\u002Fwww.xda-developers.com\u002Fmost-people-ollama-llama-cpp-local-llms-tool-serious\u002F",null,"https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1782397982700-oxni.png","tools","zh","6b6d7ea7-7e46-49ca-9e01-ce4e55eab086",[17,18,19,20,21,22],"vLLM","SGLang","MLX","MLC-LLM","ExLlamaV3","本地 LLM",[24,25,26],"vLLM 與 SGLang 把本地模型推向伺服與結構化工作流。","Apple Silicon、瀏覽器、手機與消費級 GPU 各有對應 runtime。","選型重點從模型本身，轉向吞吐、延遲與硬體匹配。",0,"2026-06-25T14:32:27.846267+00:00","2026-06-25T14:32:27.83+00:00","c3c88dd2-a940-438a-b359-0e5a24562273",{"tags":32,"relatedLang":35,"relatedPosts":39},[33],{"name":17,"slug":34},"vllm",{"id":15,"slug":36,"title":37,"language":38},"vllm-sglang-vmlx-local-llm-runtimes-en","vLLM, SGLang, vMLX: better local LLM runtimes","en",[40,46,52,58,64,70],{"id":41,"slug":42,"title":43,"cover_image":44,"image_url":44,"created_at":45,"category":13},"14235e8e-b195-41fa-994b-11bea9e16942","prompt-versioning-belongs-in-production-zh","提示詞版本控管應該進生產環境，不該只放文件裡","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1782406071974-d2w4.png","2026-06-25T16:47:23.368451+00:00",{"id":47,"slug":48,"title":49,"cover_image":50,"image_url":50,"created_at":51,"category":13},"02bf30a9-fa24-4adc-952b-a5d1cb4bd080","best-paper-lists-turn-conference-noise-into-taste-zh","Best-paper 清單把噪音變成品味","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1782392612502-p51d.png","2026-06-25T13:03:02.033956+00:00",{"id":53,"slug":54,"title":55,"cover_image":56,"image_url":56,"created_at":57,"category":13},"f95cff7f-49e3-43af-86f7-7371f9d754cb","sora-chart-loan-timing-choice-zh","SORA 圖表把貸款時機變選擇","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1782381796941-ji38.png","2026-06-25T10:02:50.036986+00:00",{"id":59,"slug":60,"title":61,"cover_image":62,"image_url":62,"created_at":63,"category":13},"07c518b2-227f-40d6-9990-04018ef74448","cccl-runtime-makes-cuda-safer-by-making-state-explicit-zh","CCCL Runtime 不是包裝層，是把 CUDA 隱性狀態改成顯性契約","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1782364674604-o7eb.png","2026-06-25T05:17:25.530308+00:00",{"id":65,"slug":66,"title":67,"cover_image":68,"image_url":68,"created_at":69,"category":13},"4c48f0a8-e999-4d0c-8ab6-c710f14d6675","35-nvidia-ai-supercomputers-turn-europe-into-a-lab-zh","35台NVIDIA超算把歐洲變實驗室","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1782363801851-zr5v.png","2026-06-25T05:02:57.878612+00:00",{"id":71,"slug":72,"title":73,"cover_image":74,"image_url":74,"created_at":75,"category":13},"e60761a1-aaab-4bde-9c2b-03450ba9056c","devin-ai-review-2026-benchmarks-pricing-tests-zh","Devin AI 測試與採購判讀指南","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1782362875481-0ddh.png","2026-06-25T04:47:27.097641+00:00",[77,82,87,92,97,102,107,112,117,122],{"id":78,"slug":79,"title":80,"created_at":81},"855cd52f-6fab-46cc-a7c1-42195e8a0de4","surepath-real-time-mcp-policy-controls-zh","SurePath 推出即時 MCP 政策控管","2026-03-26T07:57:40.77233+00:00",{"id":83,"slug":84,"title":85,"created_at":86},"9b19ab54-edef-4dbd-9ce4-a51e4bae4ebb","mcp-in-2026-the-ai-tool-layer-teams-use-zh","2026 年 MCP：團隊真的在用的 AI 工具層","2026-03-26T08:01:46.589694+00:00",{"id":88,"slug":89,"title":90,"created_at":91},"af9c46c3-7a28-410b-9f04-32b3de30a68c","prompting-in-2026-what-actually-works-zh","2026 提示工程，真正有用的是什麼","2026-03-26T08:08:12.453028+00:00",{"id":93,"slug":94,"title":95,"created_at":96},"05553086-6ed0-4758-81fd-6cab24b575e0","garry-tan-open-sources-claude-code-toolkit-zh","Garry Tan 開源 Claude Code 工具包","2026-03-26T08:26:20.068737+00:00",{"id":98,"slug":99,"title":100,"created_at":101},"042a73a2-18a2-433d-9e8f-9802b9559aac","github-ai-projects-to-watch-in-2026-zh","2026 必看 20 個 GitHub AI 專案","2026-03-26T08:28:09.619964+00:00",{"id":103,"slug":104,"title":105,"created_at":106},"a5f94120-ac0d-4483-9a8b-63590071ac6a","claude-code-vs-cursor-2026-zh","Claude Code 與 Cursor 深度對比：202…","2026-03-26T13:27:14.279193+00:00",{"id":108,"slug":109,"title":110,"created_at":111},"0975afa1-e0c7-4130-a20d-d890eaed995e","practical-github-guide-learning-ml-2026-zh","2026 機器學習入門 GitHub 實用指南","2026-03-27T01:16:49.712576+00:00",{"id":113,"slug":114,"title":115,"created_at":116},"bfdb467a-290f-4a80-b3a9-6f081afb6dff","aiml-2026-student-ai-ml-lab-repo-review-zh","AIML-2026：像課綱的學生實驗 Repo","2026-03-27T01:21:51.467798+00:00",{"id":118,"slug":119,"title":120,"created_at":121},"80cabc3e-09fc-4ff5-8f07-b8d68f5ae545","ai-trending-github-repos-and-research-feeds-zh","AI Trending：把 AI 資源收成一張表","2026-03-27T01:31:35.262183+00:00",{"id":123,"slug":124,"title":125,"created_at":126},"3ce6e6e2-bac5-463e-9f8d-45caabcc61f7","awesome-ai-for-science-research-tools-map-zh","AI 科研工具清單，開始像地圖了","2026-03-27T01:46:50.521945+00:00"]