AtomicBot’s llama.cpp fork boosts throughput on two fronts
4 ways AtomicBot’s llama.cpp fork speeds up Gemma 4 and Qwen 3.6, with matrix-bench gains up to 30-50% on the right setup.

This llama.cpp fork speeds up Gemma 4 and Qwen 3.6 with TurboQuant, MTP, and NextN.
AtomicBot-ai’s atomic-llama-cpp-turboquant fork is built around one clear promise: more tokens per second without changing your whole serving stack. The repo’s own matrix bench reports up to 30-50% short-prompt throughput gains for Gemma 4 MTP, and the TurboQuant path claims about 4.3× KV compression.
| Item | Best fit | Reported gain | Key constraint |
|---|---|---|---|
| Gemma 4 MTP | Bandwidth-bound Gemma 4 targets | ~30-50% short-prompt throughput | Uses an assistant head |
| Qwen 3.6 NextN | Qwen 3.6 dense and MoE models | ~24-36% on 35B-A3B, ~5-7% on 27B dense | Needs combined *_MTP.gguf |
| TurboQuant KV | Memory-heavy serving | ~4.3× KV compression | Best with turbo3 settings |
| TurboQuant weights | Lower-footprint deployments | Low-bit weight compression | Tradeoffs depend on backend |
1. Gemma 4 MTP speculative decoding
Get the latest AI news in your inbox
Weekly picks of model releases, tools, and deep dives — no spam, unsubscribe anytime.
No spam. Unsubscribe at any time.
The strongest headline feature here is Multi-Token Prediction for Gemma 4. The fork loads the official gemma4_assistant head with --mtp-head, then overlaps draft work with target verification so the server can move faster on short prompts.

According to the repo’s matrix bench, this path can add about 30-50% throughput on Gemma 4 26B-A4B and 31B when using f16 KV. The implementation is also tuned to avoid the usual draft-model overhead: no second context, no second tokenizer, and no separate KV cache.
- Works with Gemma 4 E2B, E4B, 26B-A4B, and 31B
- Recommended assistant quant: Q4_K_M
- Async pipeline uses
llama_decode_mtp_asyncandllama_decode_mtp_wait - Best when the target is bandwidth-bound rather than compute-bound
2. Qwen 3.6 NextN speculative decoding
For Qwen users, the fork adds NextN speculative decoding through --spec-type nextn and --model-draft. The draft context reuses the target llama_model, so it avoids a second mmap and keeps the serving setup simpler than a separate draft model.
The repo says this lands about 24-36% tokens-per-second improvement on Qwen 3.6 35B-A3B MoE, and about 5-7% on the 27B dense model in a MacBook Pro M4 Max single-slot test. That makes it a practical pick when you want more speed but do not want to rebuild your pipeline around a separate assistant model.
- Targets Qwen 3.6 27B dense and 35B-A3B MoE
- Uses combined
*_MTP.ggufdrafts - Recommended with the AtomicChat Qwen 3.6 UDT collection
- Draft tensors are pinned to Q8_0 for acceptance stability
3. TurboQuant KV cache compression
TurboQuant is the other major speed path in this fork. It applies WHT-rotated low-bit quantization to the KV cache, with backend-native kernels for Metal TurboFlash, CUDA, Vulkan, and HIP. The practical result is much smaller KV memory use, which matters when context length or batch pressure starts to dominate.

The project says -ctk turbo3 -ctv turbo3 gives about 4.3× KV compression. That is a strong fit for models that are memory-bound, especially when you want to keep more of the working set on device instead of spilling performance into memory traffic.
-ctk turbo3 -ctv turbo3
--draft-block-size 3
-ngl 99 -ngld 99
4. TurboQuant weight compression
Beyond KV cache savings, the fork also supports low-bit weight compression with formats like TQ4_1S and TQ3_1S. That gives you another way to reduce footprint before inference even starts, which can matter on laptops, smaller GPUs, and mixed CPU-GPU deployments.
This is not just a storage trick. Smaller weights can reduce load time and memory pressure, and they pair well with the project’s broader goal of making llama.cpp more efficient without forcing a specialized runtime. If you are already comfortable with GGUF workflows, this slot is easy to test.
- Weight formats mentioned: TQ4_1S, TQ3_1S
- Useful when model size is the main bottleneck
- Pairs naturally with quantized assistant heads
- Fits the same llama.cpp serving flow
5. Multimodal and cache-friendly serving extras
The fork also extends speculative decoding into multimodal serving. The README says --mmproj can be loaded alongside MTP, NextN, or Eagle3 on a single slot, with text turns benefiting from draft acceleration while image-bearing turns fall back to plain target decoding.
Another practical detail is the Hugging Face cache migration for -hf downloads. Models now land in the standard Hugging Face cache directory, which makes them easier to share with other tools and less annoying to manage across environments.
- Single-slot multimodal support with speculative decoding
- Text turns can use draft acceleration
- Image turns stay on target decoding
- Hugging Face cache layout now matches standard tooling
How to decide
If you run Gemma 4 and your bottleneck is memory bandwidth, start with MTP plus TurboQuant KV. If you run Qwen 3.6, NextN is the more direct path, especially for the 35B-A3B MoE where the repo reports the biggest uplift. In both cases, the fork is most useful when you want speed gains without leaving llama.cpp.
If you are mainly trying to shrink memory use, TurboQuant KV and weight compression are the first things to test. If your workload is mostly text and you care about short-prompt latency, MTP is the most compelling feature. If you serve mixed image and text traffic, the multimodal path is worth a look, but expect the image turns to behave like regular target decoding.
// Related Articles
- [IND]
陈立武把英特尔改成材料公司
- [IND]
Zilliz Vector Lakebase turns vector search into one platform
- [IND]
Apple’s Gemini Siri deal rewrites AI app strategy
- [IND]
Nvidia CEO Says AI Can Lift Software Stocks
- [IND]
OpenAI Statistics 2026: Users, Revenue, Funding
- [IND]
OpenAI’s custom chip is the right move against Nvidia