[IND] 6 min readOraCore Editors

AtomicBot’s llama.cpp fork boosts throughput on two fronts

4 ways AtomicBot’s llama.cpp fork speeds up Gemma 4 and Qwen 3.6, with matrix-bench gains up to 30-50% on the right setup.

Share LinkedIn
AtomicBot’s llama.cpp fork boosts throughput on two fronts

This llama.cpp fork speeds up Gemma 4 and Qwen 3.6 with TurboQuant, MTP, and NextN.

AtomicBot-ai’s atomic-llama-cpp-turboquant fork is built around one clear promise: more tokens per second without changing your whole serving stack. The repo’s own matrix bench reports up to 30-50% short-prompt throughput gains for Gemma 4 MTP, and the TurboQuant path claims about 4.3× KV compression.

ItemBest fitReported gainKey constraint
Gemma 4 MTPBandwidth-bound Gemma 4 targets~30-50% short-prompt throughputUses an assistant head
Qwen 3.6 NextNQwen 3.6 dense and MoE models~24-36% on 35B-A3B, ~5-7% on 27B denseNeeds combined *_MTP.gguf
TurboQuant KVMemory-heavy serving~4.3× KV compressionBest with turbo3 settings
TurboQuant weightsLower-footprint deploymentsLow-bit weight compressionTradeoffs depend on backend

1. Gemma 4 MTP speculative decoding

Get the latest AI news in your inbox

Weekly picks of model releases, tools, and deep dives — no spam, unsubscribe anytime.

No spam. Unsubscribe at any time.

The strongest headline feature here is Multi-Token Prediction for Gemma 4. The fork loads the official gemma4_assistant head with --mtp-head, then overlaps draft work with target verification so the server can move faster on short prompts.

AtomicBot’s llama.cpp fork boosts throughput on two fronts

According to the repo’s matrix bench, this path can add about 30-50% throughput on Gemma 4 26B-A4B and 31B when using f16 KV. The implementation is also tuned to avoid the usual draft-model overhead: no second context, no second tokenizer, and no separate KV cache.

  • Works with Gemma 4 E2B, E4B, 26B-A4B, and 31B
  • Recommended assistant quant: Q4_K_M
  • Async pipeline uses llama_decode_mtp_async and llama_decode_mtp_wait
  • Best when the target is bandwidth-bound rather than compute-bound

2. Qwen 3.6 NextN speculative decoding

For Qwen users, the fork adds NextN speculative decoding through --spec-type nextn and --model-draft. The draft context reuses the target llama_model, so it avoids a second mmap and keeps the serving setup simpler than a separate draft model.

The repo says this lands about 24-36% tokens-per-second improvement on Qwen 3.6 35B-A3B MoE, and about 5-7% on the 27B dense model in a MacBook Pro M4 Max single-slot test. That makes it a practical pick when you want more speed but do not want to rebuild your pipeline around a separate assistant model.

  • Targets Qwen 3.6 27B dense and 35B-A3B MoE
  • Uses combined *_MTP.gguf drafts
  • Recommended with the AtomicChat Qwen 3.6 UDT collection
  • Draft tensors are pinned to Q8_0 for acceptance stability

3. TurboQuant KV cache compression

TurboQuant is the other major speed path in this fork. It applies WHT-rotated low-bit quantization to the KV cache, with backend-native kernels for Metal TurboFlash, CUDA, Vulkan, and HIP. The practical result is much smaller KV memory use, which matters when context length or batch pressure starts to dominate.

AtomicBot’s llama.cpp fork boosts throughput on two fronts

The project says -ctk turbo3 -ctv turbo3 gives about 4.3× KV compression. That is a strong fit for models that are memory-bound, especially when you want to keep more of the working set on device instead of spilling performance into memory traffic.

-ctk turbo3 -ctv turbo3 --draft-block-size 3 -ngl 99 -ngld 99

4. TurboQuant weight compression

Beyond KV cache savings, the fork also supports low-bit weight compression with formats like TQ4_1S and TQ3_1S. That gives you another way to reduce footprint before inference even starts, which can matter on laptops, smaller GPUs, and mixed CPU-GPU deployments.

This is not just a storage trick. Smaller weights can reduce load time and memory pressure, and they pair well with the project’s broader goal of making llama.cpp more efficient without forcing a specialized runtime. If you are already comfortable with GGUF workflows, this slot is easy to test.

  • Weight formats mentioned: TQ4_1S, TQ3_1S
  • Useful when model size is the main bottleneck
  • Pairs naturally with quantized assistant heads
  • Fits the same llama.cpp serving flow

5. Multimodal and cache-friendly serving extras

The fork also extends speculative decoding into multimodal serving. The README says --mmproj can be loaded alongside MTP, NextN, or Eagle3 on a single slot, with text turns benefiting from draft acceleration while image-bearing turns fall back to plain target decoding.

Another practical detail is the Hugging Face cache migration for -hf downloads. Models now land in the standard Hugging Face cache directory, which makes them easier to share with other tools and less annoying to manage across environments.

  • Single-slot multimodal support with speculative decoding
  • Text turns can use draft acceleration
  • Image turns stay on target decoding
  • Hugging Face cache layout now matches standard tooling

How to decide

If you run Gemma 4 and your bottleneck is memory bandwidth, start with MTP plus TurboQuant KV. If you run Qwen 3.6, NextN is the more direct path, especially for the 35B-A3B MoE where the repo reports the biggest uplift. In both cases, the fork is most useful when you want speed gains without leaving llama.cpp.

If you are mainly trying to shrink memory use, TurboQuant KV and weight compression are the first things to test. If your workload is mostly text and you care about short-prompt latency, MTP is the most compelling feature. If you serve mixed image and text traffic, the multimodal path is worth a look, but expect the image turns to behave like regular target decoding.