[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"article-omlx-045-dev1-glm52-minimax-m3-speedups-en":3,"article-related-omlx-045-dev1-glm52-minimax-m3-speedups-en":31,"series-model-release-b4840252-4311-4c44-9814-4a3d1666302f":74},{"id":4,"slug":5,"title":6,"content":7,"summary":8,"source":9,"source_url":10,"author":11,"image_url":12,"cover_image":12,"category":13,"language":14,"translated_content":11,"related_article_id":15,"keywords":16,"key_takeaways":23,"views":27,"created_at":28,"published_at":29,"topic_cluster_id":30},"b4840252-4311-4c44-9814-4a3d1666302f","omlx-045-dev1-glm52-minimax-m3-speedups-en","oMLX 0.4.5.dev1 speeds up GLM-5.2 and MiniMax M3","\u003Cp data-speakable=\"summary\">oMLX 0.4.5.dev1 adds faster GLM-5.2 and MiniMax M3 \u003Ca href=\"\u002Ftag\u002Finference\">inference\u003C\u002Fa>, plus cache and \u003Ca href=\"\u002Ftag\u002Fbenchmark\">benchmark\u003C\u002Fa> fixes.\u003C\u002Fp>\u003Cp>\u003Ca href=\"https:\u002F\u002Fgithub.com\u002Fjundot\u002Fomlx\" target=\"_blank\" rel=\"noopener\">oMLX\u003C\u002Fa> 0.4.5.dev1 is a pre-release packed with performance work, and the numbers are hard to miss. On a \u003Ca href=\"https:\u002F\u002Fwww.apple.com\u002Fmac-studio\u002F\" target=\"_blank\" rel=\"noopener\">Mac Studio\u003C\u002Fa> with an M3 Ultra and 512 GB of unified memory, the project reports prefill gains as high as 98.9% for GLM-5.2-oQ4 at 32k context, while MiniMax-M3-oQ3 nearly doubles prefill throughput at 64k context.\u003C\u002Fp>\u003Cp>The release also fixes cache handling after hybrid cache restore and chunked prefill insertion, and it corrects benchmark loading so VLM MTP paths do not get forced through LM-only loading. That matters because these are the kind of bugs that quietly distort performance data and make real workloads behave differently from benchmark runs.\u003C\u002Fp>\u003Ctable>\u003Cthead>\u003Ctr>\u003Cth>Model\u003C\u002Fth>\u003Cth>Context\u003C\u002Fth>\u003Cth>Baseline PP\u003C\u002Fth>\u003Cth>oMLX 0.4.5 PP\u003C\u002Fth>\u003Cth>Change\u003C\u002Fth>\u003C\u002Ftr>\u003C\u002Fthead>\u003Ctbody>\u003Ctr>\u003Ctd>GLM-5.2-oQ4\u003C\u002Ftd>\u003Ctd>32k\u003C\u002Ftd>\u003Ctd>87.7 tok\u002Fs\u003C\u002Ftd>\u003Ctd>174.4 tok\u002Fs\u003C\u002Ftd>\u003Ctd>+98.9%\u003C\u002Ftd>\u003C\u002Ftr>\u003Ctr>\u003Ctd>GLM-5.2-oQ4\u003C\u002Ftd>\u003Ctd>16k\u003C\u002Ftd>\u003Ctd>128.1 tok\u002Fs\u003C\u002Ftd>\u003Ctd>178.9 tok\u002Fs\u003C\u002Ftd>\u003Ctd>+39.7%\u003C\u002Ftd>\u003C\u002Ftr>\u003Ctr>\u003Ctd>MiniMax-M3-oQ3\u003C\u002Ftd>\u003Ctd>64k\u003C\u002Ftd>\u003Ctd>158.8 tok\u002Fs\u003C\u002Ftd>\u003Ctd>307.7 tok\u002Fs\u003C\u002Ftd>\u003Ctd>+93.8%\u003C\u002Ftd>\u003C\u002Ftr>\u003Ctr>\u003Ctd>MiniMax-M3-oQ3\u003C\u002Ftd>\u003Ctd>32k\u003C\u002Ftd>\u003Ctd>228.1 tok\u002Fs\u003C\u002Ftd>\u003Ctd>327.1 tok\u002Fs\u003C\u002Ftd>\u003Ctd>+43.4%\u003C\u002Ftd>\u003C\u002Ftr>\u003C\u002Ftbody>\u003C\u002Ftable>\u003Ch2>Custom kernels are doing the heavy lifting\u003C\u002Fh2>\u003Cp>The biggest story in this release is custom kernel work for two model families: GLM-5.2 and MiniMax M3. oMLX now includes native GLM MoE DSA and Sparse MLA kernels, plus MiniMax M3 sparse-attention acceleration and adaptive long-prefill sizing. In plain English, the project is spending less time doing generic work and more time using code paths tuned for the models it is actually serving.\u003C\u002Fp>\n\u003Cfigure class=\"my-6\">\u003Cimg src=\"https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1782709371396-mn9r.png\" alt=\"oMLX 0.4.5.dev1 speeds up GLM-5.2 and MiniMax M3\" class=\"rounded-xl w-full\" loading=\"lazy\" \u002F>\u003C\u002Ffigure>\n\u003Cp>That kind of optimization shows up most clearly in prefill, the stage where a model digests the prompt before it starts generating tokens. Prefill gets expensive fast as context grows, so a 32k prompt can be a much better stress test than a short chat. On GLM-5.2-oQ4, oMLX jumps from 87.7 tok\u002Fs to 174.4 tok\u002Fs at 32k context. MiniMax-M3-oQ3 moves from 158.8 tok\u002Fs to 307.7 tok\u002Fs at 64k context.\u003C\u002Fp>\u003Cul>\u003Cli>GLM-5.2-oQ4 prefill at 32k: 87.7 tok\u002Fs to 174.4 tok\u002Fs\u003C\u002Fli>\u003Cli>MiniMax-M3-oQ3 prefill at 64k: 158.8 tok\u002Fs to 307.7 tok\u002Fs\u003C\u002Fli>\u003Cli>GLM-5.2-oQ4 prefill at 16k: 128.1 tok\u002Fs to 178.9 tok\u002Fs\u003C\u002Fli>\u003Cli>MiniMax-M3-oQ3 prefill at 32k: 228.1 tok\u002Fs to 327.1 tok\u002Fs\u003C\u002Fli>\u003C\u002Ful>\u003Cp>These numbers matter because they point to a pattern: the longer the context, the more the new kernels pay off. That is exactly where local inference stacks tend to hurt, especially on memory-rich \u003Ca href=\"\u002Ftag\u002Fapple\">Apple\u003C\u002Fa> Silicon machines that are asked to chew through long prompts, retrieval traces, or multi-turn \u003Ca href=\"\u002Ftag\u002Fagent\">agent\u003C\u002Fa> sessions.\u003C\u002Fp>\u003Ch2>Profiles and presets make the models easier to expose\u003C\u002Fh2>\u003Cp>oMLX 0.4.5.dev1 also adds API-visible model profiles and refreshed global presets. The release notes say profiles can be exposed in \u003Ccode>\u002Fv1\u002Fmodels\u003C\u002Fcode> and served through the same loaded engine, which should make \u003Ca href=\"\u002Ftag\u002Fopenai\">OpenAI\u003C\u002Fa>-compatible clients happier when they inspect what is actually available. The built-in presets now include \u003Ca href=\"https:\u002F\u002Fgithub.com\u002Fjundot\u002Fomlx\" target=\"_blank\" rel=\"noopener\">MiniMax-M3\u003C\u002Fa> and \u003Ca href=\"https:\u002F\u002Fgithub.com\u002Fjundot\u002Fomlx\" target=\"_blank\" rel=\"noopener\">GLM-5.2\u003C\u002Fa>.\u003C\u002Fp>\u003Cp>That may sound like a small API polish item, but it solves a real integration problem. If a serving layer loads one engine and exposes another set of names, client apps can misread capabilities or route requests the wrong way. For teams building local AI tools, especially ones that sit behind a standard \u003Ca href=\"https:\u002F\u002Fplatform.openai.com\u002Fdocs\u002Fapi-reference\u002Fmodels\" target=\"_blank\" rel=\"noopener\">OpenAI-compatible models endpoint\u003C\u002Fa>, cleaner model metadata reduces guesswork.\u003C\u002Fp>\u003Cblockquote>“The point of APIs is to hide the mess,” said \u003Ca href=\"https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=4N9KqWjX2jQ\" target=\"_blank\" rel=\"noopener\">John Ousterhout\u003C\u002Fa> in his widely cited talks on software design. “If you expose the right interface, the rest becomes easier.”\u003C\u002Fblockquote>\u003Cp>That quote fits this release well. oMLX is not just chasing raw speed; it is making the models easier to identify, route, and serve without special cases in every client.\u003C\u002Fp>\u003Ch2>The fixes are about trust, not cosmetics\u003C\u002Fh2>\u003Cp>The bug list is long, but the most important items are the ones that protect correctness under load. The release fixes head_dim=256 long-context prefill OOM by routing eligible work through the tiled SDPA256 path. It also fixes false VLM preflight rejections by counting actual image tokens instead of charging every image at the max-pixels ceiling.\u003C\u002Fp>\n\u003Cfigure class=\"my-6\">\u003Cimg src=\"https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1782709370913-1j6p.png\" alt=\"oMLX 0.4.5.dev1 speeds up GLM-5.2 and MiniMax M3\" class=\"rounded-xl w-full\" loading=\"lazy\" \u002F>\u003C\u002Ffigure>\n\u003Cp>Those are the kinds of bugs that can make a benchmark look broken or make a real app fail for reasons that are hard to diagnose. The release also patches VLM teardown memory reclaim, SSD cache limit enforcement across model switches, unsafe in-flight model unload races, and MiniMax M3 long-generation cache materialization. In other words, the project is tightening the bolts around the same areas that usually fail first when people push local serving hard.\u003C\u002Fp>\u003Cul>\u003Cli>head_dim=256 prefill OOM fixed with tiled SDPA256 routing\u003C\u002Fli>\u003Cli>False VLM preflight rejections fixed with actual image token counting\u003C\u002Fli>\u003Cli>SSD cache limits now hold across model switches and nested cache serialization\u003C\u002Fli>\u003Cli>MiniMax M3 long-generation cache materialization was improved\u003C\u002Fli>\u003C\u002Ful>\u003Cp>There are also smaller but still useful fixes: Gemma 4 tool-call parsing, Cohere2 streamed tool arguments, \u002Fv1\u002Fresponses reasoning output, MCP stdio cwd handling, CLI bootstrap loading, and several macOS UI issues such as clipped chat buttons and stale auto-start state. These are the details that make a project feel lived in rather than demo-only.\u003C\u002Fp>\u003Ch2>What this release says about the project\u003C\u002Fh2>\u003Cp>oMLX is clearly aiming at two audiences at once. One is the hobbyist or power user running local models on Apple hardware. The other is the developer building an app or agent stack that needs predictable model metadata, cache behavior, and benchmark numbers. This release spends real effort on both sides.\u003C\u002Fp>\u003Cp>The performance snapshot is the loudest evidence. At 1k context, GLM-5.2-oQ4 barely changes on prefill, from 186.8 tok\u002Fs to 187.7 tok\u002Fs, but by 32k context it nearly doubles. MiniMax-M3-oQ3 shows the same shape, with modest gains at short context and much larger gains as the prompt gets longer. That is a strong hint that the new kernels are targeted at the exact workloads that hurt most in practice.\u003C\u002Fp>\u003Cp>If you are tracking local inference on macOS, the takeaway is simple: this release is less about a shiny new feature and more about making long-context serving materially faster and less fragile. The next question is whether these gains hold up across more models and whether the same kernel strategy can be extended without turning the codebase into a pile of special cases.\u003C\u002Fp>\u003Cp>For now, oMLX 0.4.5.dev1 gives Apple Silicon users a concrete reason to care about prefill, cache correctness, and model metadata, because those are the pieces that decide whether a local AI stack feels fast in a demo or dependable in production.\u003C\u002Fp>","oMLX 0.4.5.dev1 adds custom kernels for GLM-5.2 and MiniMax M3, plus cache fixes and better model profile exposure.","github.com","https:\u002F\u002Fgithub.com\u002Fjundot\u002Fomlx\u002Freleases",null,"https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1782709371396-mn9r.png","model-release","en","88d353ca-468b-4774-922d-ef0cbc2edd68",[17,18,19,20,21,22],"oMLX","GLM-5.2","MiniMax M3","Apple Silicon","prefill speed","cache fixes",[24,25,26],"oMLX 0.4.5.dev1 focuses on faster long-context prefill for GLM-5.2 and MiniMax M3.","The release adds API-visible model profiles and refreshed presets for cleaner client integration.","Several cache, VLM, and benchmark-loading bugs were fixed to improve correctness under load.",0,"2026-06-29T05:02:28.770698+00:00","2026-06-29T05:02:28.762+00:00","8a720a1b-e905-4cc6-8607-4887b319116e",{"tags":32,"relatedLang":33,"relatedPosts":37},[],{"id":15,"slug":34,"title":35,"language":36},"omlx-045-dev1-glm52-minimax-m3-speedups-zh","oMLX 0.4.5.dev1 讓長上下文更快","zh",[38,44,50,56,62,68],{"id":39,"slug":40,"title":41,"cover_image":42,"image_url":42,"created_at":43,"category":13},"666962b5-ce8c-430c-9d07-8cdfd44ffd09","llama-legends-380-season-3-heroes-raids-en","Llama Legends 3.8.0 adds Season 3 heroes and raids","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1782711179242-ednu.png","2026-06-29T05:32:33.398141+00:00",{"id":45,"slug":46,"title":47,"cover_image":48,"image_url":48,"created_at":49,"category":13},"1fe27411-ad64-4717-85c9-89b5c350253c","grok-45-private-beta-tesla-spacex-en","Grok 4.5 enters private beta at Tesla and SpaceX","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1782687764199-vjto.png","2026-06-28T23:02:23.343104+00:00",{"id":51,"slug":52,"title":53,"cover_image":54,"image_url":54,"created_at":55,"category":13},"35368bfc-0dbe-45dc-b422-87b1bd350ac0","google-openrl-llm-fine-tuning-kubernetes-en","Google OpenRL brings RL fine-tuning to Kubernetes","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1782572578249-jlty.png","2026-06-27T15:02:27.543012+00:00",{"id":57,"slug":58,"title":59,"cover_image":60,"image_url":60,"created_at":61,"category":13},"8fe33efd-3a68-4fe3-935f-f0f5d3f058fc","diffusiongemma-runs-fast-on-nvidia-rtx-dgx-en","DiffusionGemma runs fast on NVIDIA RTX and DGX","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1782570781225-7xo9.png","2026-06-27T14:32:34.997765+00:00",{"id":63,"slug":64,"title":65,"cover_image":66,"image_url":66,"created_at":67,"category":13},"ce53e9e6-c310-4434-9971-4f4f3a274577","glm-52-beats-gpt-55-coding-benchmarks-en","GLM-5.2 beats GPT-5.5 on coding tests","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1782564469790-2zyi.png","2026-06-27T12:47:27.758841+00:00",{"id":69,"slug":70,"title":71,"cover_image":72,"image_url":72,"created_at":73,"category":13},"730a2199-d009-4a27-8f00-8e9ea6a4b02e","openai-gpt-56-rollout-us-request-en","OpenAI narrows GPT-5.6 rollout after U.S. request","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1782555472898-iuil.png","2026-06-27T10:17:28.937624+00:00",[75,80,85,90,95,100,105,110,115,120],{"id":76,"slug":77,"title":78,"created_at":79},"d4cffde7-9b50-4cc7-bb68-8bc9e3b15477","nvidia-rubin-ai-supercomputer-en","NVIDIA Unveils Rubin: A Leap in AI Supercomputing","2026-03-25T16:24:35.155565+00:00",{"id":81,"slug":82,"title":83,"created_at":84},"eab919b9-fbac-4048-89fc-afad6749ccef","google-gemini-ai-innovations-2026-en","Google's AI Leap with Gemini Innovations in 2026","2026-03-25T16:27:18.841838+00:00",{"id":86,"slug":87,"title":88,"created_at":89},"5f5cfc67-3384-4816-a8f6-19e44d90113d","gap-google-gemini-ai-checkout-en","Gap Teams Up with Google Gemini for AI-Driven Checkout","2026-03-25T16:27:46.483272+00:00",{"id":91,"slug":92,"title":93,"created_at":94},"f6d04567-47f6-49ec-804c-52e61ab91225","ai-model-release-wave-march-2026-en","Navigating the AI Model Release Wave of March 2026","2026-03-25T16:28:45.409716+00:00",{"id":96,"slug":97,"title":98,"created_at":99},"895c150c-569e-4fdf-939d-dade785c990e","small-language-models-transform-ai-en","Small Language Models: Llama 3.2 and Phi-3 Transform AI","2026-03-25T16:30:26.688313+00:00",{"id":101,"slug":102,"title":103,"created_at":104},"38eb1d26-d961-4fd3-ae12-9c4089680f5f","midjourney-v8-alpha-features-pricing-en","Midjourney V8 Alpha: A Deep Dive into Its Features and Pricing","2026-03-26T01:25:36.387587+00:00",{"id":106,"slug":107,"title":108,"created_at":109},"bf36bb9e-3444-4fb8-ab19-0df6bc9d8271","rag-2026-indispensable-ai-bridge-en","RAG in 2026: The Indispensable AI Bridge","2026-03-26T01:28:34.472046+00:00",{"id":111,"slug":112,"title":113,"created_at":114},"60881d6d-2310-44ef-b1fb-7f98e9dd2f0e","xiaomi-mimo-trio-agents-robots-voice-en","Xiaomi’s MiMo trio targets agents, robots, and voice","2026-03-28T03:05:08.899895+00:00",{"id":116,"slug":117,"title":118,"created_at":119},"f063d8d1-41d1-4de4-8ebc-6c40511b9369","xiaomi-mimo-v2-pro-1t-moe-agents-en","Xiaomi MiMo-V2-Pro: 1T MoE Model for Agents","2026-03-28T03:06:19.238032+00:00",{"id":121,"slug":122,"title":123,"created_at":124},"a1379e9a-6785-4ff5-9b0a-8cff55f8264f","cursor-composer-2-started-from-kimi-en","Cursor’s Composer 2 started from Kimi","2026-03-28T03:11:59.132398+00:00"]