[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"article-atomicbot-llama-cpp-fork-throughput-gains-en":3,"article-related-atomicbot-llama-cpp-fork-throughput-gains-en":35,"series-industry-cc87056f-b2e8-4ef0-966c-bf82ccffbb54":84},{"id":4,"slug":5,"title":6,"content":7,"summary":8,"source":9,"source_url":10,"author":11,"image_url":12,"cover_image":12,"category":13,"language":14,"translated_content":11,"related_article_id":15,"keywords":16,"key_takeaways":26,"views":31,"created_at":32,"published_at":33,"topic_cluster_id":34},"cc87056f-b2e8-4ef0-966c-bf82ccffbb54","atomicbot-llama-cpp-fork-throughput-gains-en","AtomicBot’s llama.cpp fork boosts throughput on two fronts","\u003Cp data-speakable=\"summary\">This llama.cpp fork speeds up Gemma 4 and Qwen 3.6 with \u003Ca href=\"\u002Ftag\u002Fturboquant\">TurboQuant\u003C\u002Fa>, MTP, and NextN.\u003C\u002Fp>\n\u003Cp>AtomicBot-ai’s \u003Ca href=\"https:\u002F\u002Fgithub.com\u002FAtomicBot-ai\u002Fatomic-llama-cpp-turboquant\">atomic-llama-cpp-turboquant\u003C\u002Fa> fork is built around one clear promise: more tokens per second without changing your whole serving stack. The repo’s own matrix bench reports up to 30-50% short-prompt throughput gains for Gemma 4 MTP, and the TurboQuant path claims about 4.3× KV compression.\u003C\u002Fp>\n\u003Ctable>\u003Cthead>\u003Ctr>\u003Cth>Item\u003C\u002Fth>\u003Cth>Best fit\u003C\u002Fth>\u003Cth>Reported gain\u003C\u002Fth>\u003Cth>Key constraint\u003C\u002Fth>\u003C\u002Ftr>\u003C\u002Fthead>\u003Ctbody>\u003Ctr>\u003Ctd>Gemma 4 MTP\u003C\u002Ftd>\u003Ctd>Bandwidth-bound Gemma 4 targets\u003C\u002Ftd>\u003Ctd>~30-50% short-prompt throughput\u003C\u002Ftd>\u003Ctd>Uses an assistant head\u003C\u002Ftd>\u003C\u002Ftr>\u003Ctr>\u003Ctd>Qwen 3.6 NextN\u003C\u002Ftd>\u003Ctd>Qwen 3.6 dense and MoE models\u003C\u002Ftd>\u003Ctd>~24-36% on 35B-A3B, ~5-7% on 27B dense\u003C\u002Ftd>\u003Ctd>Needs combined *_MTP.gguf\u003C\u002Ftd>\u003C\u002Ftr>\u003Ctr>\u003Ctd>TurboQuant KV\u003C\u002Ftd>\u003Ctd>Memory-heavy serving\u003C\u002Ftd>\u003Ctd>~4.3× KV compression\u003C\u002Ftd>\u003Ctd>Best with turbo3 settings\u003C\u002Ftd>\u003C\u002Ftr>\u003Ctr>\u003Ctd>TurboQuant weights\u003C\u002Ftd>\u003Ctd>Lower-footprint deployments\u003C\u002Ftd>\u003Ctd>Low-bit weight compression\u003C\u002Ftd>\u003Ctd>Tradeoffs depend on backend\u003C\u002Ftd>\u003C\u002Ftr>\u003C\u002Ftbody>\u003C\u002Ftable>\n\u003Ch2>1. Gemma 4 MTP speculative decoding\u003C\u002Fh2>\n\u003Cp>The strongest headline feature here is Multi-\u003Ca href=\"\u002Ftag\u002Ftoken\">Token\u003C\u002Fa> Prediction for Gemma 4. The fork loads the official gemma4_assistant head with \u003Ccode>--mtp-head\u003C\u002Fcode>, then overlaps draft work with target verification so the server can move faster on short prompts.\u003C\u002Fp>\n\u003Cfigure class=\"my-6\">\u003Cimg src=\"https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1782332277361-4xh4.png\" alt=\"AtomicBot’s llama.cpp fork boosts throughput on two fronts\" class=\"rounded-xl w-full\" loading=\"lazy\" \u002F>\u003C\u002Ffigure>\n\n\u003Cp>According to the repo’s matrix bench, this path can add about 30-50% throughput on Gemma 4 26B-A4B and 31B when using f16 KV. The implementation is also tuned to avoid the usual draft-model overhead: no second context, no second tokenizer, and no separate \u003Ca href=\"\u002Ftag\u002Fkv-cache\">KV cache\u003C\u002Fa>.\u003C\u002Fp>\n\u003Cul>\n  \u003Cli>Works with Gemma 4 E2B, E4B, 26B-A4B, and 31B\u003C\u002Fli>\n  \u003Cli>Recommended assistant quant: Q4_K_M\u003C\u002Fli>\n  \u003Cli>Async pipeline uses \u003Ccode>llama_decode_mtp_async\u003C\u002Fcode> and \u003Ccode>llama_decode_mtp_wait\u003C\u002Fcode>\u003C\u002Fli>\n  \u003Cli>Best when the target is bandwidth-bound rather than compute-bound\u003C\u002Fli>\n\u003C\u002Ful>\n\u003Ch2>2. Qwen 3.6 NextN speculative decoding\u003C\u002Fh2>\n\u003Cp>For Qwen users, the fork adds NextN speculative decoding through \u003Ccode>--spec-type nextn\u003C\u002Fcode> and \u003Ccode>--model-draft\u003C\u002Fcode>. The draft context reuses the target \u003Ccode>llama_model\u003C\u002Fcode>, so it avoids a second mmap and keeps the serving setup simpler than a separate draft model.\u003C\u002Fp>\n\u003Cp>The repo says this lands about 24-36% tokens-per-second improvement on Qwen 3.6 35B-A3B MoE, and about 5-7% on the 27B dense model in a MacBook Pro M4 Max single-slot test. That makes it a practical pick when you want more speed but do not want to rebuild your pipeline around a separate assistant model.\u003C\u002Fp>\n\u003Cul>\n  \u003Cli>Targets Qwen 3.6 27B dense and 35B-A3B MoE\u003C\u002Fli>\n  \u003Cli>Uses combined \u003Ccode>*_MTP.gguf\u003C\u002Fcode> drafts\u003C\u002Fli>\n  \u003Cli>Recommended with the AtomicChat Qwen 3.6 UDT collection\u003C\u002Fli>\n  \u003Cli>Draft tensors are pinned to Q8_0 for acceptance stability\u003C\u002Fli>\n\u003C\u002Ful>\n\u003Ch2>3. TurboQuant KV cache compression\u003C\u002Fh2>\n\u003Cp>TurboQuant is the other major speed path in this fork. It applies WHT-rotated low-bit quantization to the KV cache, with backend-native kernels for Metal TurboFlash, \u003Ca href=\"\u002Ftag\u002Fcuda\">CUDA\u003C\u002Fa>, Vulkan, and HIP. The practical result is much smaller KV memory use, which matters when context length or batch pressure starts to dominate.\u003C\u002Fp>\n\u003Cfigure class=\"my-6\">\u003Cimg src=\"https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1782332271352-xddf.png\" alt=\"AtomicBot’s llama.cpp fork boosts throughput on two fronts\" class=\"rounded-xl w-full\" loading=\"lazy\" \u002F>\u003C\u002Ffigure>\n\n\u003Cp>The project says \u003Ccode>-ctk turbo3 -ctv turbo3\u003C\u002Fcode> gives about 4.3× KV compression. That is a strong fit for models that are memory-bound, especially when you want to keep more of the working set on device instead of spilling performance into memory traffic.\u003C\u002Fp>\n\u003Ccode>-ctk turbo3 -ctv turbo3\n--draft-block-size 3\n-ngl 99 -ngld 99\u003C\u002Fcode>\n\u003Ch2>4. TurboQuant weight compression\u003C\u002Fh2>\n\u003Cp>Beyond KV cache savings, the fork also supports low-bit weight compression with formats like TQ4_1S and TQ3_1S. That gives you another way to reduce footprint before \u003Ca href=\"\u002Ftag\u002Finference\">inference\u003C\u002Fa> even starts, which can matter on laptops, smaller GPUs, and mixed CPU-GPU deployments.\u003C\u002Fp>\n\u003Cp>This is not just a storage trick. Smaller weights can reduce load time and memory pressure, and they pair well with the project’s broader goal of making llama.cpp more efficient without forcing a specialized runtime. If you are already comfortable with GGUF workflows, this slot is easy to test.\u003C\u002Fp>\n\u003Cul>\n  \u003Cli>Weight formats mentioned: TQ4_1S, TQ3_1S\u003C\u002Fli>\n  \u003Cli>Useful when model size is the main bottleneck\u003C\u002Fli>\n  \u003Cli>Pairs naturally with quantized assistant heads\u003C\u002Fli>\n  \u003Cli>Fits the same llama.cpp serving flow\u003C\u002Fli>\n\u003C\u002Ful>\n\u003Ch2>5. Multimodal and cache-friendly serving extras\u003C\u002Fh2>\n\u003Cp>The fork also extends speculative decoding into multimodal serving. The README says \u003Ccode>--mmproj\u003C\u002Fcode> can be loaded alongside MTP, NextN, or Eagle3 on a single slot, with text turns benefiting from draft acceleration while image-bearing turns fall back to plain target decoding.\u003C\u002Fp>\n\u003Cp>Another practical detail is the Hugging Face cache migration for \u003Ccode>-hf\u003C\u002Fcode> downloads. Models now land in the standard Hugging Face cache directory, which makes them easier to share with other tools and less annoying to manage across environments.\u003C\u002Fp>\n\u003Cul>\n  \u003Cli>Single-slot multimodal support with speculative decoding\u003C\u002Fli>\n  \u003Cli>Text turns can use draft acceleration\u003C\u002Fli>\n  \u003Cli>Image turns stay on target decoding\u003C\u002Fli>\n  \u003Cli>\u003Ca href=\"https:\u002F\u002Fhuggingface.co\">Hugging Face\u003C\u002Fa> cache layout now matches standard tooling\u003C\u002Fli>\n\u003C\u002Ful>\n\u003Ch2>How to decide\u003C\u002Fh2>\n\u003Cp>If you run Gemma 4 and your bottleneck is memory bandwidth, start with MTP plus TurboQuant KV. If you run Qwen 3.6, NextN is the more direct path, especially for the 35B-A3B MoE where the repo reports the biggest uplift. In both cases, the fork is most useful when you want speed gains without leaving llama.cpp.\u003C\u002Fp>\n\u003Cp>If you are mainly trying to shrink memory use, TurboQuant KV and weight compression are the first things to test. If your workload is mostly text and you care about short-prompt latency, MTP is the most compelling feature. If you serve mixed image and text traffic, the multimodal path is worth a look, but expect the image turns to behave like regular target decoding.\u003C\u002Fp>","4 ways AtomicBot’s llama.cpp fork speeds up Gemma 4 and Qwen 3.6, with matrix-bench gains up to 30-50% on the right setup.","github.com","https:\u002F\u002Fgithub.com\u002FAtomicBot-ai\u002Fatomic-llama-cpp-turboquant",null,"https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1782332277361-4xh4.png","industry","en","493ea70d-fffd-4365-ba76-63069ada5744",[17,18,19,20,21,22,23,24,25],"llama.cpp","TurboQuant","Gemma 4","MTP","Qwen 3.6","NextN","speculative decoding","KV cache compression","GGUF",[27,28,29,30],"Gemma 4 MTP is the biggest throughput win when the model is bandwidth-bound.","Qwen 3.6 NextN is best for users who want speculative decoding without a separate draft model.","TurboQuant cuts KV memory use sharply and can also compress weights.","Multimodal serving works, but only text turns benefit from draft acceleration.",0,"2026-06-24T20:17:29.158539+00:00","2026-06-24T20:17:29.152+00:00","72af1271-7288-4720-8714-09bfdc439fa0",{"tags":36,"relatedLang":43,"relatedPosts":47},[37,39,41],{"name":19,"slug":38},"gemma-4",{"name":17,"slug":40},"llamacpp",{"name":18,"slug":42},"turboquant",{"id":15,"slug":44,"title":45,"language":46},"atomicbot-llama-cpp-fork-throughput-gains-zh","AtomicBot 的 llama.cpp 分支，兩條路都加速","zh",[48,54,60,66,72,78],{"id":49,"slug":50,"title":51,"cover_image":52,"image_url":52,"created_at":53,"category":13},"06fa9a5f-2245-41f7-89da-f6b91cb208d7","gemini-3-5-pro-delay-google-ai-cycle-en","Gemini 3.5 Pro 迟到暴露了谷歌节奏问题","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1782349366519-are3.png","2026-06-25T01:02:20.930248+00:00",{"id":55,"slug":56,"title":57,"cover_image":58,"image_url":58,"created_at":59,"category":13},"dd7c45c5-3970-4c63-8acf-e2b47a0944a8","worthing-watersports-duotone-demo-wales-en","Worthing Watersports brings Duotone demo gear to Wales","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1782345768727-o94m.png","2026-06-25T00:02:22.940982+00:00",{"id":61,"slug":62,"title":63,"cover_image":64,"image_url":64,"created_at":65,"category":13},"b10159ab-4111-48da-bc2a-f64cbff423ef","chen-liwu-intel-packaging-materials-podcast-en","陈立武把英特尔改成材料公司","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1782342206068-n3vd.png","2026-06-24T23:02:57.988319+00:00",{"id":67,"slug":68,"title":69,"cover_image":70,"image_url":70,"created_at":71,"category":13},"372f6e06-007b-4110-93dc-851c736aaae9","zilliz-vector-lakebase-unified-ai-data-platform-en","Zilliz Vector Lakebase turns vector search into one platform","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1782339470615-jdtd.png","2026-06-24T22:17:21.447456+00:00",{"id":73,"slug":74,"title":75,"cover_image":76,"image_url":76,"created_at":77,"category":13},"118ac217-a559-4f44-b0d2-70e1ef77e7f3","apples-gemini-siri-deal-ai-app-strategy-en","Apple’s Gemini Siri deal rewrites AI app strategy","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1782333172815-df6f.png","2026-06-24T20:32:27.384211+00:00",{"id":79,"slug":80,"title":81,"cover_image":82,"image_url":82,"created_at":83,"category":13},"bb903842-e570-466d-8f1f-2e1c20f15fd9","nvidia-ceo-ai-lift-software-stocks-en","Nvidia CEO Says AI Can Lift Software Stocks","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1782328672666-jtb5.png","2026-06-24T19:17:28.11369+00:00",[85,90,95,100,105,110,115,120,125,130],{"id":86,"slug":87,"title":88,"created_at":89},"d35a1bd9-e709-412e-a2df-392df1dc572a","ai-impact-2026-developments-market-en","AI's Impact in 2026: Key Developments and Market Shifts","2026-03-25T16:20:33.205823+00:00",{"id":91,"slug":92,"title":93,"created_at":94},"5ed27921-5fd6-492e-8c59-78393bf37710","trumps-ai-legislative-framework-en","Trump's AI Legislative Framework: What's Inside?","2026-03-25T16:22:20.005325+00:00",{"id":96,"slug":97,"title":98,"created_at":99},"e454a642-f03c-4794-b185-5f651aebbaca","nvidia-gtc-2026-key-highlights-innovations-en","NVIDIA GTC 2026: Key Highlights and Innovations","2026-03-25T16:22:47.882615+00:00",{"id":101,"slug":102,"title":103,"created_at":104},"0ebb5b16-774a-4922-945d-5f2ce1df5a6d","claude-usage-diversifies-learning-curves-en","Claude Usage Diversifies, Learning Curves Emerge","2026-03-25T16:25:50.770376+00:00",{"id":106,"slug":107,"title":108,"created_at":109},"69934e86-2fc5-4280-8223-7b917a48ace8","openclaw-ai-commoditization-concerns-en","OpenClaw's Rise Raises Concerns of AI Model Commoditization","2026-03-25T16:26:30.582047+00:00",{"id":111,"slug":112,"title":113,"created_at":114},"b4b2575b-2ac8-46b2-b90e-ab1d7c060797","google-gemini-ai-rollout-2026-en","Google's Gemini AI Rollout Extended to 2026","2026-03-25T16:28:14.808842+00:00",{"id":116,"slug":117,"title":118,"created_at":119},"6e18bc65-42ae-4ad0-b564-67d7f66b979e","meta-llama4-fabricated-results-scandal-en","Meta's Llama 4 Scandal: Fabricated AI Test Results Unveiled","2026-03-25T16:29:15.482836+00:00",{"id":121,"slug":122,"title":123,"created_at":124},"bf888e9d-08be-4f47-996c-7b24b5ab3500","accenture-mistral-ai-deployment-en","Accenture and Mistral AI Team Up for AI Deployment","2026-03-25T16:31:01.894655+00:00",{"id":126,"slug":127,"title":128,"created_at":129},"5382b536-fad2-49c6-ac85-9eb2bae49f35","mistral-ai-high-stakes-2026-en","Mistral AI: Facing High Stakes in 2026","2026-03-25T16:31:39.941974+00:00",{"id":131,"slug":132,"title":133,"created_at":134},"9da3d2d6-b669-4971-ba1d-17fdb3548ed5","cursors-meteoric-rise-pressures-en","Cursor's Meteoric Rise Faces Industry Pressures","2026-03-25T16:32:21.899217+00:00"]