[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"article-deploy-minimax-m3-with-vllm-openai-api-en":3,"article-related-deploy-minimax-m3-with-vllm-openai-api-en":31,"series-tools-77c071b4-4373-449e-b812-2577d9644514":80},{"id":4,"slug":5,"title":6,"content":7,"summary":8,"source":9,"source_url":10,"author":11,"image_url":12,"cover_image":12,"category":13,"language":14,"translated_content":11,"related_article_id":15,"keywords":16,"key_takeaways":23,"views":27,"created_at":28,"published_at":29,"topic_cluster_id":30},"77c071b4-4373-449e-b812-2577d9644514","deploy-minimax-m3-with-vllm-openai-api-en","Deploy MiniMax M3 with vLLM OpenAI API","\u003Cp data-speakable=\"summary\">Run MiniMax M3 locally with \u003Ca href=\"\u002Ftag\u002Fvllm\">vLLM\u003C\u002Fa> and expose an \u003Ca href=\"\u002Ftag\u002Fopenai\">OpenAI\u003C\u002Fa>-compatible API.\u003C\u002Fp>\u003Cp>This guide is for developers who want to serve \u003Ca href=\"https:\u002F\u002Fzhuanlan.zhihu.com\u002Fp\u002F2049845285195605901\">MiniMax M3\u003C\u002Fa> with \u003Ca href=\"https:\u002F\u002Fgithub.com\u002Fvllm-project\u002Fvllm\">vLLM\u003C\u002Fa> and keep the interface OpenAI-compatible. By the end, you will have a running model server, tool-calling and reasoning parsers enabled, and a quick way to verify that requests reach the endpoint.\u003C\u002Fp>\u003Cp>You will also know which runtime pieces matter most: GPU access, Hugging Face cache mounting, tensor parallelism, and the exact flags used by the MiniMax M3 recipe in vLLM.\u003C\u002Fp>\u003Ch2>Before you start\u003C\u002Fh2>\u003Cul>\u003Cli>Docker installed, version 24+.\u003C\u002Fli>\u003Cli>NVIDIA GPU with CUDA-capable drivers installed.\u003C\u002Fli>\u003Cli>At least 1 GPU; 8 GPUs recommended for the sample tensor parallel setting.\u003C\u002Fli>\u003Cli>Hugging Face account and access to the \u003Ccode>MiniMaxAI\u002FMiniMax-M3-MXFP8\u003C\u002Fcode> model.\u003C\u002Fli>\u003Cli>Hugging Face token configured locally with \u003Ccode>huggingface-cli login\u003C\u002Fcode> or an equivalent secret mount.\u003C\u002Fli>\u003Cli>Linux host with \u003Ccode>--privileged\u003C\u002Fcode> and \u003Ccode>--ipc=host\u003C\u002Fcode> support for the container run command.\u003C\u002Fli>\u003Cli>Enough disk space for model weights and cache, ideally 100 GB+ free.\u003C\u002Fli>\u003C\u002Ful>\u003Ch2>Step 1: Pull the vLLM OpenAI image\u003C\u002Fh2>\u003Cp>Your first outcome is a ready-to-run container image that already includes the OpenAI-compatible server entrypoint used by the MiniMax M3 recipe.\u003C\u002Fp>\n\u003Cfigure class=\"my-6\">\u003Cimg src=\"https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1781954275829-y5gk.png\" alt=\"Deploy MiniMax M3 with vLLM OpenAI API\" class=\"rounded-xl w-full\" loading=\"lazy\" \u002F>\u003C\u002Ffigure>\n\u003Cpre>\u003Ccode>docker pull vllm\u002Fvllm-openai:minimax-m3\u003C\u002Fcode>\u003C\u002Fpre>\u003Cp>After the pull completes, you should see \u003Ca href=\"\u002Ftag\u002Fdocker\">Docker\u003C\u002Fa> report the image as downloaded locally. If you run \u003Ccode>docker images\u003C\u002Fcode>, you should see \u003Ccode>vllm\u002Fvllm-openai\u003C\u002Fcode> with the \u003Ccode>minimax-m3\u003C\u002Fcode> tag.\u003C\u002Fp>\u003Ch2>Step 2: Mount the Hugging Face cache\u003C\u002Fh2>\u003Cp>Your next outcome is persistent model caching, so the weights do not download again every time you restart the server.\u003C\u002Fp>\n\u003Cfigure class=\"my-6\">\u003Cimg src=\"https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1781954272727-870t.png\" alt=\"Deploy MiniMax M3 with vLLM OpenAI API\" class=\"rounded-xl w-full\" loading=\"lazy\" \u002F>\u003C\u002Ffigure>\n\u003Cpre>\u003Ccode>mkdir -p ~\u002F.cache\u002Fhuggingface\u003C\u002Fcode>\u003C\u002Fpre>\u003Cp>Then make sure your Hugging Face credentials are available to the runtime. A common path is to log in once on the host and mount the cache into the container, as shown in the final run command. You should be able to list files under \u003Ccode>~\u002F.cache\u002Fhuggingface\u003C\u002Fcode> and see \u003Ca href=\"\u002Ftag\u002Ftoken\">token\u003C\u002Fa> and model cache directories after the first download.\u003C\u002Fp>\u003Ch2>Step 3: Start the MiniMax M3 server\u003C\u002Fh2>\u003Cp>Your main outcome is a live API server on port 8000 that loads MiniMax M3 with the recipe settings from the source guide.\u003C\u002Fp>\u003Cpre>\u003Ccode>docker run --gpus all --privileged --ipc=host -p 8000:8000 \\\n  -v ~\u002F.cache\u002Fhuggingface:\u002Froot\u002F.cache\u002Fhuggingface \\\n  vllm\u002Fvllm-openai:minimax-m3 MiniMaxAI\u002FMiniMax-M3-MXFP8 \\\n  --block-size 128 \\\n  --tensor-parallel-size 8 \\\n  --tool-call-parser minimax_m3 \\\n  --enable-auto-tool-choice \\\n  --reasoning-parser minimax_m3\u003C\u002Fcode>\u003C\u002Fpre>\u003Cp>When the container starts correctly, you should see vLLM logs that mention model loading, tokenizer setup, and the OpenAI-compatible server binding to \u003Ccode>0.0.0.0:8000\u003C\u002Fcode>. If the model is downloading, expect extra progress output before the server becomes ready.\u003C\u002Fp>\u003Ch2>Step 4: Verify the OpenAI-compatible endpoint\u003C\u002Fh2>\u003Cp>Your outcome here is proof that the server is reachable and responding to API calls, not just running in the background.\u003C\u002Fp>\u003Cpre>\u003Ccode>curl http:\u002F\u002Flocalhost:8000\u002Fv1\u002Fmodels\u003C\u002Fcode>\u003C\u002Fpre>\u003Cp>You should see a JSON response that lists the loaded model or available model entry. If that request returns model metadata, the server is healthy and the OpenAI-style route is working.\u003C\u002Fp>\u003Ch2>Step 5: Confirm tool calling and reasoning parsers\u003C\u002Fh2>\u003Cp>Your final outcome is a server configured for agentic workflows, with MiniMax M3-specific tool-call and reasoning parsing enabled.\u003C\u002Fp>\u003Cpre>\u003Ccode>curl http:\u002F\u002Flocalhost:8000\u002Fv1\u002Fchat\u002Fcompletions \\\n  -H 'Content-Type: application\u002Fjson' \\\n  -d '{\n    \"model\": \"MiniMaxAI\u002FMiniMax-M3-MXFP8\",\n    \"messages\": [{\"role\": \"user\", \"content\": \"List two tools you would use to inspect a repo.\"}],\n    \"max_tokens\": 64\n  }'\u003C\u002Fcode>\u003C\u002Fpre>\u003Cp>You should see a chat-completions response rather than an error, and the server logs should show the request passing through the MiniMax M3 parser path. If you later connect an \u003Ca href=\"\u002Ftag\u002Fagent\">agent\u003C\u002Fa> framework, this is the endpoint you will point it at.\u003C\u002Fp>\u003Ctable>\u003Cthead>\u003Ctr>\u003Cth>Metric\u003C\u002Fth>\u003Cth>Before\u002FBaseline\u003C\u002Fth>\u003Cth>After\u002FResult\u003C\u002Fth>\u003C\u002Ftr>\u003C\u002Fthead>\u003Ctbody>\u003Ctr>\u003Ctd>API compatibility\u003C\u002Ftd>\u003Ctd>No local endpoint\u003C\u002Ftd>\u003Ctd>OpenAI-compatible server on port 8000\u003C\u002Ftd>\u003C\u002Ftr>\u003Ctr>\u003Ctd>Tool-calling support\u003C\u002Ftd>\u003Ctd>Disabled\u003C\u002Ftd>\u003Ctd>\u003Ccode>--enable-auto-tool-choice\u003C\u002Fcode> and \u003Ccode>--tool-call-parser minimax_m3\u003C\u002Fcode>\u003C\u002Ftd>\u003C\u002Ftr>\u003Ctr>\u003Ctd>Reasoning parsing\u003C\u002Ftd>\u003Ctd>Disabled\u003C\u002Ftd>\u003Ctd>\u003Ccode>--reasoning-parser minimax_m3\u003C\u002Fcode> enabled\u003C\u002Ftd>\u003C\u002Ftr>\u003Ctr>\u003Ctd>Parallelism\u003C\u002Ftd>\u003Ctd>Single-device default\u003C\u002Ftd>\u003Ctd>\u003Ccode>--tensor-parallel-size 8\u003C\u002Fcode>\u003C\u002Ftd>\u003C\u002Ftr>\u003C\u002Ftbody>\u003C\u002Ftable>\u003Ch2>Common mistakes\u003C\u002Fh2>\u003Cul>\u003Cli>Using the wrong model name. Fix: keep \u003Ccode>MiniMaxAI\u002FMiniMax-M3-MXFP8\u003C\u002Fcode> exactly as shown in the recipe unless the vLLM docs say otherwise.\u003C\u002Fli>\u003Cli>Forgetting GPU support in Docker. Fix: install the NVIDIA Container Toolkit and rerun with \u003Ccode>--gpus all\u003C\u002Fcode>.\u003C\u002Fli>\u003Cli>Setting tensor parallelism higher than available GPUs. Fix: match \u003Ccode>--tensor-parallel-size\u003C\u002Fcode> to the number of visible GPUs, or reduce it for a smaller machine.\u003C\u002Fli>\u003C\u002Ful>\u003Ch2>What's next\u003C\u002Fh2>\u003Cp>Once the server is stable, the next step is to connect an agent framework or client SDK to \u003Ccode>http:\u002F\u002Flocalhost:8000\u002Fv1\u003C\u002Fcode>, then tune context length, batching, and GPU memory settings using the \u003Ca href=\"https:\u002F\u002Frecipes.vllm.ai\u002FMiniMaxAI\u002FMiniMax-M3?variant=mxfp8\">vLLM recipe\u003C\u002Fa> and the MiniMax M3 source notes.\u003C\u002Fp>","Run MiniMax M3 locally with vLLM and expose an OpenAI-compatible API.","zhuanlan.zhihu.com","https:\u002F\u002Fzhuanlan.zhihu.com\u002Fp\u002F2049845285195605901",null,"https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1781954275829-y5gk.png","tools","en","7beaabe3-5421-4e2b-a42a-d1a7b669be12",[17,18,19,20,21,22],"MiniMax M3","vLLM","OpenAI-compatible API","Docker","tool calling","reasoning parser",[24,25,26],"MiniMax M3 can be served locally through vLLM with an OpenAI-compatible API.","The recipe depends on GPU-enabled Docker, Hugging Face cache mounting, and the MXFP8 model.","Tool calling and reasoning are enabled with MiniMax-specific parser flags.",0,"2026-06-20T11:17:30.525369+00:00","2026-06-20T11:17:30.509+00:00","e09db9fe-b9af-480c-af01-5f9a94d39123",{"tags":32,"relatedLang":39,"relatedPosts":43},[33,35,37],{"name":21,"slug":34},"tool-calling",{"name":18,"slug":36},"vllm",{"name":20,"slug":38},"docker",{"id":15,"slug":40,"title":41,"language":42},"deploy-minimax-m3-with-vllm-openai-api-zh","用 vLLM 部署 MiniMax M3 並開啟 OpenAI API","zh",[44,50,56,62,68,74],{"id":45,"slug":46,"title":47,"cover_image":48,"image_url":48,"created_at":49,"category":13},"3ef4d3d4-b628-498c-acb7-34131b1a60cd","fde-role-sales-engineering-playbook-en","FDE岗位把售前和工程拧成一股绳","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1781966000604-7muv.png","2026-06-20T14:32:51.011973+00:00",{"id":51,"slug":52,"title":53,"cover_image":54,"image_url":54,"created_at":55,"category":13},"12a7a9d9-3333-4e6b-9ab3-dc56f9ebf037","namastack-turns-outbox-pain-into-reliable-events-en","Namastack turns outbox pain into reliable events","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1781949793158-bqj9.png","2026-06-20T10:02:50.381051+00:00",{"id":57,"slug":58,"title":59,"cover_image":60,"image_url":60,"created_at":61,"category":13},"dd6fa1a4-f821-4b06-b0f9-f48cada0bfb7","claude-design-assets-to-design-system-en","Claude Design turns assets into a team design system","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1781946192909-hr83.png","2026-06-20T09:02:47.22907+00:00",{"id":63,"slug":64,"title":65,"cover_image":66,"image_url":66,"created_at":67,"category":13},"2b98a581-b1d7-4c76-9d84-70c46ba38213","vs-code-turns-folder-into-workspace-en","VS Code turns a folder into a workspace","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1781939001917-7k7a.png","2026-06-20T07:02:53.076917+00:00",{"id":69,"slug":70,"title":71,"cover_image":72,"image_url":72,"created_at":73,"category":13},"25ea27d2-abec-486a-899f-b4dd5602f2cd","midjourney-medical-turns-scans-into-spa-en","Midjourney Medical turns scans into a spa","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1781909286439-yujm.png","2026-06-19T22:47:41.119572+00:00",{"id":75,"slug":76,"title":77,"cover_image":78,"image_url":78,"created_at":79,"category":13},"b101c255-aaf5-4fb1-a6b3-b82bef35778f","three-multimodal-models-work-in-claude-code-en","Three multimodal models now work in Claude Code","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1781892162781-se19.png","2026-06-19T18:02:16.12379+00:00",[81,86,91,96,101,106,111,116,121,126],{"id":82,"slug":83,"title":84,"created_at":85},"8008f1a9-7a00-4bad-88c9-3eedc9c6b4b1","surepath-ai-mcp-policy-controls-en","SurePath AI's New MCP Policy Controls Enhance AI Security","2026-03-26T01:26:52.222015+00:00",{"id":87,"slug":88,"title":89,"created_at":90},"27e39a8f-b65d-4f7b-a875-859e2b210156","mcp-standard-ai-tools-2026-en","MCP Standard in 2026: Integrating AI Tools","2026-03-26T01:27:43.127519+00:00",{"id":92,"slug":93,"title":94,"created_at":95},"165f9a19-c92d-46ba-b3f0-7125f662921d","rag-2026-transforming-enterprise-ai-en","How RAG in 2026 is Transforming Enterprise AI","2026-03-26T01:28:11.485236+00:00",{"id":97,"slug":98,"title":99,"created_at":100},"6a2a8e6e-b956-49d8-be12-cc47bdc132b2","mastering-ai-prompts-2026-guide-en","Mastering AI Prompts: A 2026 Guide for Developers","2026-03-26T01:29:07.835148+00:00",{"id":102,"slug":103,"title":104,"created_at":105},"3ab2c67e-4664-4c67-a013-687a2f605814","garry-tan-open-sources-claude-code-toolkit-en","Garry Tan Open-Sources a Claude Code Toolkit","2026-03-26T08:26:20.245934+00:00",{"id":107,"slug":108,"title":109,"created_at":110},"66a7cbf8-7e76-41d4-9bbf-eaca9761bf69","github-ai-projects-to-watch-in-2026-en","20 GitHub AI Projects to Watch in 2026","2026-03-26T08:28:09.752027+00:00",{"id":112,"slug":113,"title":114,"created_at":115},"9f332fda-eace-448a-a292-2283951eee71","practical-github-guide-learning-ml-2026-en","A Practical GitHub Guide to Learning ML in 2026","2026-03-27T01:16:50.125678+00:00",{"id":117,"slug":118,"title":119,"created_at":120},"1b1f637d-0f4d-42bd-974b-07b53829144d","aiml-2026-student-ai-ml-lab-repo-review-en","AIML-2026 Is a Bare-Bones Student Lab Repo","2026-03-27T01:21:51.661231+00:00",{"id":122,"slug":123,"title":124,"created_at":125},"6d1bf3f6-e191-4d30-b55b-8a0722fa6afe","ai-trending-github-repos-and-research-feeds-en","AI Trending Tracks Repos and Research Feeds","2026-03-27T01:31:35.709532+00:00",{"id":127,"slug":128,"title":129,"created_at":130},"010539a1-4c3a-4bd3-937a-26616422ee0d","awesome-ai-for-science-research-tools-map-en","Awesome AI for Science Is Becoming a Real Research Map","2026-03-27T01:46:50.89513+00:00"]