oMLX 0.4.5.dev1 speeds up GLM-5.2 and MiniMax M3

OraCore Editors

Back to home

[MODEL] June 29, 20267 min readOraCore Editors

oMLX 0.4.5.dev1 speeds up GLM-5.2 and MiniMax M3

oMLX 0.4.5.dev1 adds custom kernels for GLM-5.2 and MiniMax M3, plus cache fixes and better model profile exposure.

Share LinkedIn

oMLX 0.4.5.dev1 speeds up GLM-5.2 and MiniMax M3

oMLX 0.4.5.dev1 adds faster GLM-5.2 and MiniMax M3 inference, plus cache and benchmark fixes.

oMLX 0.4.5.dev1 is a pre-release packed with performance work, and the numbers are hard to miss. On a Mac Studio with an M3 Ultra and 512 GB of unified memory, the project reports prefill gains as high as 98.9% for GLM-5.2-oQ4 at 32k context, while MiniMax-M3-oQ3 nearly doubles prefill throughput at 64k context.

The release also fixes cache handling after hybrid cache restore and chunked prefill insertion, and it corrects benchmark loading so VLM MTP paths do not get forced through LM-only loading. That matters because these are the kind of bugs that quietly distort performance data and make real workloads behave differently from benchmark runs.

Model	Context	Baseline PP	oMLX 0.4.5 PP	Change
GLM-5.2-oQ4	32k	87.7 tok/s	174.4 tok/s	+98.9%
GLM-5.2-oQ4	16k	128.1 tok/s	178.9 tok/s	+39.7%
MiniMax-M3-oQ3	64k	158.8 tok/s	307.7 tok/s	+93.8%
MiniMax-M3-oQ3	32k	228.1 tok/s	327.1 tok/s	+43.4%

Custom kernels are doing the heavy lifting

Get the latest AI news in your inbox

Weekly picks of model releases, tools, and deep dives — no spam, unsubscribe anytime.

No spam. Unsubscribe at any time.

The biggest story in this release is custom kernel work for two model families: GLM-5.2 and MiniMax M3. oMLX now includes native GLM MoE DSA and Sparse MLA kernels, plus MiniMax M3 sparse-attention acceleration and adaptive long-prefill sizing. In plain English, the project is spending less time doing generic work and more time using code paths tuned for the models it is actually serving.

That kind of optimization shows up most clearly in prefill, the stage where a model digests the prompt before it starts generating tokens. Prefill gets expensive fast as context grows, so a 32k prompt can be a much better stress test than a short chat. On GLM-5.2-oQ4, oMLX jumps from 87.7 tok/s to 174.4 tok/s at 32k context. MiniMax-M3-oQ3 moves from 158.8 tok/s to 307.7 tok/s at 64k context.

GLM-5.2-oQ4 prefill at 32k: 87.7 tok/s to 174.4 tok/s
MiniMax-M3-oQ3 prefill at 64k: 158.8 tok/s to 307.7 tok/s
GLM-5.2-oQ4 prefill at 16k: 128.1 tok/s to 178.9 tok/s
MiniMax-M3-oQ3 prefill at 32k: 228.1 tok/s to 327.1 tok/s

These numbers matter because they point to a pattern: the longer the context, the more the new kernels pay off. That is exactly where local inference stacks tend to hurt, especially on memory-rich Apple Silicon machines that are asked to chew through long prompts, retrieval traces, or multi-turn agent sessions.

Profiles and presets make the models easier to expose

oMLX 0.4.5.dev1 also adds API-visible model profiles and refreshed global presets. The release notes say profiles can be exposed in /v1/models and served through the same loaded engine, which should make OpenAI-compatible clients happier when they inspect what is actually available. The built-in presets now include MiniMax-M3 and GLM-5.2.

That may sound like a small API polish item, but it solves a real integration problem. If a serving layer loads one engine and exposes another set of names, client apps can misread capabilities or route requests the wrong way. For teams building local AI tools, especially ones that sit behind a standard OpenAI-compatible models endpoint, cleaner model metadata reduces guesswork.

“The point of APIs is to hide the mess,” said John Ousterhout in his widely cited talks on software design. “If you expose the right interface, the rest becomes easier.”

That quote fits this release well. oMLX is not just chasing raw speed; it is making the models easier to identify, route, and serve without special cases in every client.

The fixes are about trust, not cosmetics

The bug list is long, but the most important items are the ones that protect correctness under load. The release fixes head_dim=256 long-context prefill OOM by routing eligible work through the tiled SDPA256 path. It also fixes false VLM preflight rejections by counting actual image tokens instead of charging every image at the max-pixels ceiling.

Those are the kinds of bugs that can make a benchmark look broken or make a real app fail for reasons that are hard to diagnose. The release also patches VLM teardown memory reclaim, SSD cache limit enforcement across model switches, unsafe in-flight model unload races, and MiniMax M3 long-generation cache materialization. In other words, the project is tightening the bolts around the same areas that usually fail first when people push local serving hard.

head_dim=256 prefill OOM fixed with tiled SDPA256 routing
False VLM preflight rejections fixed with actual image token counting
SSD cache limits now hold across model switches and nested cache serialization
MiniMax M3 long-generation cache materialization was improved

There are also smaller but still useful fixes: Gemma 4 tool-call parsing, Cohere2 streamed tool arguments, /v1/responses reasoning output, MCP stdio cwd handling, CLI bootstrap loading, and several macOS UI issues such as clipped chat buttons and stale auto-start state. These are the details that make a project feel lived in rather than demo-only.

What this release says about the project

oMLX is clearly aiming at two audiences at once. One is the hobbyist or power user running local models on Apple hardware. The other is the developer building an app or agent stack that needs predictable model metadata, cache behavior, and benchmark numbers. This release spends real effort on both sides.

The performance snapshot is the loudest evidence. At 1k context, GLM-5.2-oQ4 barely changes on prefill, from 186.8 tok/s to 187.7 tok/s, but by 32k context it nearly doubles. MiniMax-M3-oQ3 shows the same shape, with modest gains at short context and much larger gains as the prompt gets longer. That is a strong hint that the new kernels are targeted at the exact workloads that hurt most in practice.

If you are tracking local inference on macOS, the takeaway is simple: this release is less about a shiny new feature and more about making long-context serving materially faster and less fragile. The next question is whether these gains hold up across more models and whether the same kernel strategy can be extended without turning the codebase into a pile of special cases.

For now, oMLX 0.4.5.dev1 gives Apple Silicon users a concrete reason to care about prefill, cache correctness, and model metadata, because those are the pieces that decide whether a local AI stack feels fast in a demo or dependable in production.

// Related Articles

oMLX 0.4.5.dev1 speeds up GLM-5.2 and MiniMax M3

Custom kernels are doing the heavy lifting

Get the latest AI news in your inbox

Profiles and presets make the models easier to expose

The fixes are about trust, not cosmetics

What this release says about the project

Llama Legends 3.8.0 adds Season 3 heroes and raids

Grok 4.5 enters private beta at Tesla and SpaceX

Google OpenRL brings RL fine-tuning to Kubernetes

DiffusionGemma runs fast on NVIDIA RTX and DGX

GLM-5.2 beats GPT-5.5 on coding tests

OpenAI narrows GPT-5.6 rollout after U.S. request