[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"article-omlx-045-dev1-glm52-minimax-m3-speedups-zh":3,"article-related-omlx-045-dev1-glm52-minimax-m3-speedups-zh":35,"series-model-release-88d353ca-468b-4774-922d-ef0cbc2edd68":78},{"id":4,"slug":5,"title":6,"content":7,"summary":8,"source":9,"source_url":10,"author":11,"image_url":12,"cover_image":12,"category":13,"language":14,"translated_content":11,"related_article_id":15,"keywords":16,"key_takeaways":26,"views":31,"created_at":32,"published_at":33,"topic_cluster_id":34},"88d353ca-468b-4774-922d-ef0cbc2edd68","omlx-045-dev1-glm52-minimax-m3-speedups-zh","oMLX 0.4.5.dev1 讓長上下文更快","\u003Cp data-speakable=\"summary\">oMLX 0.4.5.dev1 針對 GLM-5.2 和 MiniMax M3 加速長上下文推論，還\u003Ca href=\"\u002Fnews\u002Fcuda-toolkit-13-3-fixes-nested-divergence-bug-zh\">修掉\u003C\u002Fa> cache 與 \u003Ca href=\"\u002Ftag\u002Fbenchmark\">benchmark\u003C\u002Fa> 載入問題。\u003C\u002Fp>\u003Cp>\u003Ca href=\"https:\u002F\u002Fgithub.com\u002Fjundot\u002Fomlx\" target=\"_blank\" rel=\"noopener\">oMLX\u003C\u002Fa> 這版很直接。它不是加一個花俏功能，而是把推論速度拉上來。官方數據顯示，在 \u003Ca href=\"https:\u002F\u002Fwww.apple.com\u002Fmac-studio\u002F\" target=\"_blank\" rel=\"noopener\">Mac Studio\u003C\u002Fa> 的 M3 Ultra、512 GB unified memory 上，GLM-5.2-oQ4 在 32k context 的 prefill 從 87.7 tok\u002Fs 拉到 174.4 tok\u002Fs。這種數字，不是小修小補。\u003C\u002Fp>\u003Cp>MiniMax-M3-oQ3 也很兇。64k context 的 prefill 從 158.8 tok\u002Fs 提到 307.7 tok\u002Fs。講白了，就是長 prompt 越長，這版越有感。對本機跑 \u003Ca href=\"https:\u002F\u002Fgithub.com\u002Fjundot\u002Fomlx\" target=\"_blank\" rel=\"noopener\">LLM\u003C\u002Fa> 的人來說，這比多一個設定選單實際太多。\u003C\u002Fp>\u003Ctable>\u003Cthead>\u003Ctr>\u003Cth>Model\u003C\u002Fth>\u003Cth>Context\u003C\u002Fth>\u003Cth>Baseline PP\u003C\u002Fth>\u003Cth>oMLX 0.4.5.dev1 PP\u003C\u002Fth>\u003Cth>Change\u003C\u002Fth>\u003C\u002Ftr>\u003C\u002Fthead>\u003Ctbody>\u003Ctr>\u003Ctd>GLM-5.2-oQ4\u003C\u002Ftd>\u003Ctd>32k\u003C\u002Ftd>\u003Ctd>87.7 tok\u002Fs\u003C\u002Ftd>\u003Ctd>174.4 tok\u002Fs\u003C\u002Ftd>\u003Ctd>+98.9%\u003C\u002Ftd>\u003C\u002Ftr>\u003Ctr>\u003Ctd>GLM-5.2-oQ4\u003C\u002Ftd>\u003Ctd>16k\u003C\u002Ftd>\u003Ctd>128.1 tok\u002Fs\u003C\u002Ftd>\u003Ctd>178.9 tok\u002Fs\u003C\u002Ftd>\u003Ctd>+39.7%\u003C\u002Ftd>\u003C\u002Ftr>\u003Ctr>\u003Ctd>MiniMax-M3-oQ3\u003C\u002Ftd>\u003Ctd>64k\u003C\u002Ftd>\u003Ctd>158.8 tok\u002Fs\u003C\u002Ftd>\u003Ctd>307.7 tok\u002Fs\u003C\u002Ftd>\u003Ctd>+93.8%\u003C\u002Ftd>\u003C\u002Ftr>\u003Ctr>\u003Ctd>MiniMax-M3-oQ3\u003C\u002Ftd>\u003Ctd>32k\u003C\u002Ftd>\u003Ctd>228.1 tok\u002Fs\u003C\u002Ftd>\u003Ctd>327.1 tok\u002Fs\u003C\u002Ftd>\u003Ctd>+43.4%\u003C\u002Ftd>\u003C\u002Ftr>\u003C\u002Ftbody>\u003C\u002Ftable>\u003Ch2>這版的重點，就是自訂 kernels\u003C\u002Fh2>\u003Cp>這次最有料的地方，是 GLM-5.2 和 MiniMax M3 的自訂 kernel。oMLX 加進了 GLM MoE DSA、Sparse MLA，還有 MiniMax M3 的 sparse-attention 加速與 adaptive long-prefill sizing。意思很簡單。它不再只跑通用路線，而是直接針對模型內部結構下刀。\u003C\u002Fp>\n\u003Cfigure class=\"my-6\">\u003Cimg src=\"https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1782709372375-25nm.png\" alt=\"oMLX 0.4.5.dev1 讓長上下文更快\" class=\"rounded-xl w-full\" loading=\"lazy\" \u002F>\u003C\u002Ffigure>\n\u003Cp>這種做法很務實。因為在推論世界裡，真正拖慢速度的，常常不是算力不夠，而是路徑太通用。尤其是 prefill。你只要把 context 拉到 32k 或 64k，任何小浪費都會被放大。\u003C\u002Fp>\u003Cp>官方數據也很誠實。GLM-5.2-oQ4 在 32k context 從 87.7 tok\u002Fs 跳到 174.4 tok\u002Fs。MiniMax-M3-oQ3 在 64k context 從 158.8 tok\u002Fs 到 307.7 tok\u002Fs。這不是「有變快」，而是接近翻倍。\u003C\u002Fp>\u003Cul>\u003Cli>GLM-5.2-oQ4：32k prefill 提升 98.9%\u003C\u002Fli>\u003Cli>MiniMax-M3-oQ3：64k prefill 提升 93.8%\u003C\u002Fli>\u003Cli>GLM-5.2-oQ4：16k prefill 提升 39.7%\u003C\u002Fli>\u003Cli>MiniMax-M3-oQ3：32k prefill 提升 43.4%\u003C\u002Fli>\u003C\u002Ful>\u003Ch2>cache 和 benchmark 修正，才是能不能信的關鍵\u003C\u002Fh2>\u003Cp>很多人只看速度。老實說，這很危險。因為 benchmark 如果載入錯了，數字再漂亮也沒用。oMLX 0.4.5.dev1 修了 hybrid cache restore 之後的 cache 處理，也修了 chunked prefill insertion 的問題。這代表結果比較不會被 cache 狀態搞歪。\u003C\u002Fp>\u003Cp>它還修掉 benchmark loading 的路徑問題。原本 VLM MTP 可能會被硬塞進 LM-only loading。這種錯誤很陰。你表面上看到的是「能跑」，實際上跑的路徑根本不對。對做測試的人來說，這比當機更煩。\u003C\u002Fp>\u003Cp>還有幾個修正也很實用。像是 head_dim=256 的長上下文 prefill OOM，現在會走 tiled SDPA256 path。VLM 的 preflight 也改成算實際 image tokens，不再一律用 max-pixels ceiling。這些都不是裝飾品，是避免你在真實工作負載裡踩雷。\u003C\u002Fp>\u003Cblockquote>“The point of APIs is to hide the mess,” said \u003Ca href=\"https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=4N9KqWjX2jQ\" target=\"_blank\" rel=\"noopener\">John Ousterhout\u003C\u002Fa>. “If you expose the right interface, the rest becomes easier.”\u003C\u002Fblockquote>\u003Cp>這句話放在這版很貼切。oMLX 不是只追求快。它也在整理介面，讓\u003Ca href=\"\u002Fnews\u002Fmistral-ocr-4-citation-ready-structured-output-zh\">資料\u003C\u002Fa>、cache、benchmark 的邏輯比較一致。這才是能拿來做工具的底子。\u003C\u002Fp>\u003Ch2>模型 profiles 和 preset，讓整合少掉很多鳥事\u003C\u002Fh2>\u003Cp>這版還加了 API 可見的 model profiles。它可以出現在 \u003Ccode>\u002Fv1\u002Fmodels\u003C\u002Fcode>，也能透過同一個 loaded engine 對外提供。對 \u003Ca href=\"\u002Ftag\u002Fopenai\">OpenAI\u003C\u002Fa>-compatible client 來說，這種資訊很重要。因為 client 會先看你到底載了什麼，再決定\u003Ca href=\"\u002Fnews\u002Fanthropic-965b-valuation-ai-stocks-exposure-zh\">怎麼\u003C\u002Fa>送請求。\u003C\u002Fp>\n\u003Cfigure class=\"my-6\">\u003Cimg src=\"https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1782709368421-7hjj.png\" alt=\"oMLX 0.4.5.dev1 讓長上下文更快\" class=\"rounded-xl w-full\" loading=\"lazy\" \u002F>\u003C\u002Ffigure>\n\u003Cp>如果 serving layer 跟實際 engine 名稱對不起來，問題就會很煩。前端或 \u003Ca href=\"\u002Ftag\u002Fagent\">agent\u003C\u002Fa> 可能以為某個 model 可用，結果後端根本不是那個 profile。這種錯不會立刻爆炸，但會讓整套系統很難 debug。\u003C\u002Fp>\u003Cp>oMLX 也更新了 global presets，包含 \u003Ca href=\"https:\u002F\u002Fgithub.com\u002Fjundot\u002Fomlx\" target=\"_blank\" rel=\"noopener\">MiniMax-M3\u003C\u002Fa> 和 \u003Ca href=\"https:\u002F\u002Fgithub.com\u002Fjundot\u002Fomlx\" target=\"_blank\" rel=\"noopener\">GLM-5.2\u003C\u002Fa>。對開發者來說，這代表少寫一些手動配置，也少猜一些參數。\u003C\u002Fp>\u003Cul>\u003Cli>profiles 可透過 \u003Ccode>\u002Fv1\u002Fmodels\u003C\u002Fcode> 暴露\u003C\u002Fli>\u003Cli>同一個 engine 可提供 profile 資訊\u003C\u002Fli>\u003Cli>新增 MiniMax-M3 與 GLM-5.2 預設值\u003C\u002Fli>\u003Cli>更適合 OpenAI-compatible client 串接\u003C\u002Fli>\u003C\u002Ful>\u003Ch2>跟其他本機推論方案比，這版很偏務實\u003C\u002Fh2>\u003Cp>如果你有碰過 \u003Ca href=\"https:\u002F\u002Fgithub.com\u002Fggerganov\u002Fllama.cpp\" target=\"_blank\" rel=\"noopener\">llama.cpp\u003C\u002Fa>、\u003Ca href=\"https:\u002F\u002Fgithub.com\u002Fml-explore\u002Fmlx\" target=\"_blank\" rel=\"noopener\">MLX\u003C\u002Fa>，或是各種包裝過的本機 serving 工具，你會知道一件事。速度只是第一關。真正麻煩的是長上下文、cache、一致性，還有 API metadata。\u003C\u002Fp>\u003Cp>oMLX 這版的方向很明確。它不是去搶「誰的短 prompt 最快」。它是在長 context 上做文章。這很合理，因為現在很多應用都不是單輪聊天，而是文件摘要、RAG、agent trace、程式碼分析。這些場景一拉長，差距就出來了。\u003C\u002Fp>\u003Cp>從數據看，短 context 的提升沒那麼戲劇化。像 GLM-5.2-oQ4 在 1k context，prefill 只從 186.8 tok\u002Fs 到 187.7 tok\u002Fs。可是一拉到 32k，就直接接近翻倍。這代表 kernel 的價值不是平均分散，而是集中在高壓場景。\u003C\u002Fp>\u003Cul>\u003Cli>1k context 的提升小，32k 和 64k 才是主戰場\u003C\u002Fli>\u003Cli>長上下文更接近真實 RAG 與 agent 工作負載\u003C\u002Fli>\u003Cli>API metadata 修正，對整合比純速度更重要\u003C\u002Fli>\u003Cli>Apple Silicon 本機推論更吃 cache 與 memory 行為\u003C\u002Fli>\u003C\u002Ful>\u003Ch2>這版透露出一個很清楚的方向\u003C\u002Fh2>\u003Cp>oMLX 的路線不是做一個什麼都能包的通用殼。它比較像是在 \u003Ca href=\"\u002Ftag\u002Fapple\">Apple\u003C\u002Fa> Silicon 上，認真把幾個熱門模型跑順。這種選擇很現實，也很對開發者胃口。因為你真的不會想在本機 serving 裡，一直跟 cache 和 metadata 打架。\u003C\u002Fp>\u003Cp>如果你現在就在用 Mac 跑模型，這版值得試。特別是你手上有長 prompt、長文件，或是多輪 agent 流程。它修的不只是速度，還有那些會讓 benchmark 跟真實工作負載脫節的地方。\u003C\u002Fp>\u003Cp>我自己的判斷很簡單。接下來如果 oMLX 能把這種 kernel 策略再擴到更多模型，Apple Silicon 的本機推論體驗會更像「能拿來幹活」，而不是「只能 demo」。你如果在做 local \u003Ca href=\"\u002Ftag\u002Fai-工具\">AI 工具\u003C\u002Fa>，這版很適合先跑一次基準測試，再決定要不要升。\u003C\u002Fp>\u003Cp>下一步很明確：拿你自己的 16k、32k、64k prompt 直接測。不要只看短測資。長上下文才會告訴你，這版到底有沒有真的幫上忙。\u003C\u002Fp>","oMLX 0.4.5.dev1 為 GLM-5.2 和 MiniMax M3 加入自訂 kernel，長上下文 prefill 明顯加速，也修掉 cache 與 benchmark 載入問題。","github.com","https:\u002F\u002Fgithub.com\u002Fjundot\u002Fomlx\u002Freleases",null,"https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1782709372375-25nm.png","model-release","zh","b4840252-4311-4c44-9814-4a3d1666302f",[17,18,19,20,21,22,23,24,25],"oMLX","GLM-5.2","MiniMax M3","Apple Silicon","本機推論","prefill","cache","benchmark","LLM",[27,28,29,30],"oMLX 0.4.5.dev1 針對 GLM-5.2 與 MiniMax M3 加入自訂 kernels，長上下文 prefill 最有感。","官方數據顯示，32k 到 64k context 的 prefill 最高接近翻倍。","這版也修了 cache、benchmark loading、VLM preflight 等問題，讓數字更可信。","對做本機 AI 工具的人來說，這版比單純的性能宣傳更實用。",1,"2026-06-29T05:02:28.341041+00:00","2026-06-29T05:02:28.331+00:00","0ccb5d2e-69f1-4354-a3e0-cb370221cd95",{"tags":36,"relatedLang":37,"relatedPosts":41},[],{"id":15,"slug":38,"title":39,"language":40},"omlx-045-dev1-glm52-minimax-m3-speedups-en","oMLX 0.4.5.dev1 speeds up GLM-5.2 and MiniMax M3","en",[42,48,54,60,66,72],{"id":43,"slug":44,"title":45,"cover_image":46,"image_url":46,"created_at":47,"category":13},"edf8e66b-c717-4cc1-b15a-96839bb7bbcf","llama-legends-380-season-3-heroes-raids-zh","Llama Legends 3.8.0 推出 Season 3 英雄與突襲","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1782711179415-qurv.png","2026-06-29T05:32:32.733919+00:00",{"id":49,"slug":50,"title":51,"cover_image":52,"image_url":52,"created_at":53,"category":13},"e6ae84b6-4e55-4ab2-a1cf-4a08e23cbc77","grok-45-private-beta-tesla-spacex-zh","Grok 4.5 先進 Tesla 和 SpaceX 內測","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1782687769532-te5b.png","2026-06-28T23:02:22.915901+00:00",{"id":55,"slug":56,"title":57,"cover_image":58,"image_url":58,"created_at":59,"category":13},"186b266a-5b45-4bd4-85a4-5fa62fcc50dc","google-openrl-llm-fine-tuning-kubernetes-zh","Google OpenRL 把 RL 細調搬上 Kubernetes","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1782572576166-gzxw.png","2026-06-27T15:02:27.036919+00:00",{"id":61,"slug":62,"title":63,"cover_image":64,"image_url":64,"created_at":65,"category":13},"9258a3d6-b70c-493d-84b9-c791df86f495","diffusiongemma-runs-fast-on-nvidia-rtx-dgx-zh","DiffusionGemma 在 RTX 與 DGX 跑很快","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1782570778712-u643.png","2026-06-27T14:32:34.436232+00:00",{"id":67,"slug":68,"title":69,"cover_image":70,"image_url":70,"created_at":71,"category":13},"1f01e408-91a8-4d9b-839d-57e751bd646f","glm-52-beats-gpt-55-coding-benchmarks-zh","GLM-5.2 用更低成本打贏 GPT-5.5","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1782564470376-xtcx.png","2026-06-27T12:47:27.330349+00:00",{"id":73,"slug":74,"title":75,"cover_image":76,"image_url":76,"created_at":77,"category":13},"611bdb86-e048-42b1-8bc5-c1adbd7fdcd9","openai-gpt-56-rollout-us-request-zh","OpenAI 收緊 GPT-5.6 上線節奏","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1782555471713-w9pw.png","2026-06-27T10:17:28.515168+00:00",[79,84,89,94,99,104,109,114,119,124],{"id":80,"slug":81,"title":82,"created_at":83},"58b64033-7eb6-49b9-9aab-01cf8ae1b2f2","nvidia-rubin-six-chips-one-ai-supercomputer-zh","NVIDIA Rubin 把六顆晶片塞進 AI 機櫃","2026-03-26T07:18:45.861277+00:00",{"id":85,"slug":86,"title":87,"created_at":88},"0dcc2c61-c2a6-480d-adb8-dd225fc68914","march-2026-ai-model-news-what-mattered-zh","2026 年 3 月 AI 模型新聞重點","2026-03-26T07:32:08.386348+00:00",{"id":90,"slug":91,"title":92,"created_at":93},"214ab08b-5ce5-4b5c-8b72-47619d8675dd","why-small-models-are-winning-on-device-ai-zh","小模型為何吃下裝置端 AI","2026-03-26T07:36:30.488966+00:00",{"id":95,"slug":96,"title":97,"created_at":98},"785624b2-0355-4b82-adc3-de5e45eecd88","midjourney-v8-faster-images-higher-costs-zh","Midjourney V8 變快了，也變貴了","2026-03-26T07:52:03.562971+00:00",{"id":100,"slug":101,"title":102,"created_at":103},"cda76b92-d209-4134-86c1-a60f5bc7b128","xiaomi-mimo-trio-agents-robots-voice-zh","小米 MiMo 三模型瞄準代理、機器人與語音","2026-03-28T03:05:08.779489+00:00",{"id":105,"slug":106,"title":107,"created_at":108},"9e1044b4-946d-47fe-9e2a-c2ee032e1164","xiaomi-mimo-v2-pro-1t-moe-agents-zh","小米 MiMo-V2-Pro 登場：1T MoE 模型","2026-03-28T03:06:19.002353+00:00",{"id":110,"slug":111,"title":112,"created_at":113},"c4b6186f-bd84-4598-997e-c6e31d543c0d","cursor-composer-2-agentic-coding-model-zh","Cursor Composer 2 走向代理式寫碼","2026-03-28T03:13:06.422716+00:00",{"id":115,"slug":116,"title":117,"created_at":118},"e112e76f-ec3b-408f-810e-e93ae21a888a","apple-siri-gemini-distilled-models-zh","Apple Siri 牽手 Gemini 的真相","2026-03-29T04:52:57.886544+00:00",{"id":120,"slug":121,"title":122,"created_at":123},"c679b51f-194a-463b-87fc-7695256ff752","mimo-v2-pro-vs-omni-vs-flash-2026-zh","MiMo V2 Pro、Omni、Flash 怎麼選","2026-04-02T01:18:43.576128+00:00",{"id":125,"slug":126,"title":127,"created_at":128},"3b988fd7-6749-4f01-ba25-c0ad7486dc31","z-ai-glm-5v-turbo-design2code-claude-zh","GLM-5V-Turbo 在 Design2Code 贏了…","2026-04-02T04:03:36.31741+00:00"]