AtomicBot’s llama.cpp fork boosts throughput on two fronts

OraCore Editors

[IND] June 25, 20266 min readOraCore Editors

AtomicBot’s llama.cpp fork boosts throughput on two fronts

4 ways AtomicBot’s llama.cpp fork speeds up Gemma 4 and Qwen 3.6, with matrix-bench gains up to 30-50% on the right setup.

Gemma 4 llama.cpp TurboQuant

Share LinkedIn

AtomicBot’s llama.cpp fork boosts throughput on two fronts

This llama.cpp fork speeds up Gemma 4 and Qwen 3.6 with TurboQuant, MTP, and NextN.

AtomicBot-ai’s atomic-llama-cpp-turboquant fork is built around one clear promise: more tokens per second without changing your whole serving stack. The repo’s own matrix bench reports up to 30-50% short-prompt throughput gains for Gemma 4 MTP, and the TurboQuant path claims about 4.3× KV compression.

Item	Best fit	Reported gain	Key constraint
Gemma 4 MTP	Bandwidth-bound Gemma 4 targets	~30-50% short-prompt throughput	Uses an assistant head
Qwen 3.6 NextN	Qwen 3.6 dense and MoE models	~24-36% on 35B-A3B, ~5-7% on 27B dense	Needs combined *_MTP.gguf
TurboQuant KV	Memory-heavy serving	~4.3× KV compression	Best with turbo3 settings
TurboQuant weights	Lower-footprint deployments	Low-bit weight compression	Tradeoffs depend on backend

1. Gemma 4 MTP speculative decoding

Get the latest AI news in your inbox

Weekly picks of model releases, tools, and deep dives — no spam, unsubscribe anytime.

No spam. Unsubscribe at any time.

The strongest headline feature here is Multi-Token Prediction for Gemma 4. The fork loads the official gemma4_assistant head with --mtp-head, then overlaps draft work with target verification so the server can move faster on short prompts.

According to the repo’s matrix bench, this path can add about 30-50% throughput on Gemma 4 26B-A4B and 31B when using f16 KV. The implementation is also tuned to avoid the usual draft-model overhead: no second context, no second tokenizer, and no separate KV cache.

Works with Gemma 4 E2B, E4B, 26B-A4B, and 31B
Recommended assistant quant: Q4_K_M
Async pipeline uses llama_decode_mtp_async and llama_decode_mtp_wait
Best when the target is bandwidth-bound rather than compute-bound

2. Qwen 3.6 NextN speculative decoding

For Qwen users, the fork adds NextN speculative decoding through --spec-type nextn and --model-draft. The draft context reuses the target llama_model, so it avoids a second mmap and keeps the serving setup simpler than a separate draft model.

The repo says this lands about 24-36% tokens-per-second improvement on Qwen 3.6 35B-A3B MoE, and about 5-7% on the 27B dense model in a MacBook Pro M4 Max single-slot test. That makes it a practical pick when you want more speed but do not want to rebuild your pipeline around a separate assistant model.

Targets Qwen 3.6 27B dense and 35B-A3B MoE
Uses combined *_MTP.gguf drafts
Recommended with the AtomicChat Qwen 3.6 UDT collection
Draft tensors are pinned to Q8_0 for acceptance stability

3. TurboQuant KV cache compression

TurboQuant is the other major speed path in this fork. It applies WHT-rotated low-bit quantization to the KV cache, with backend-native kernels for Metal TurboFlash, CUDA, Vulkan, and HIP. The practical result is much smaller KV memory use, which matters when context length or batch pressure starts to dominate.

The project says -ctk turbo3 -ctv turbo3 gives about 4.3× KV compression. That is a strong fit for models that are memory-bound, especially when you want to keep more of the working set on device instead of spilling performance into memory traffic.

-ctk turbo3 -ctv turbo3
--draft-block-size 3
-ngl 99 -ngld 99

4. TurboQuant weight compression

Beyond KV cache savings, the fork also supports low-bit weight compression with formats like TQ4_1S and TQ3_1S. That gives you another way to reduce footprint before inference even starts, which can matter on laptops, smaller GPUs, and mixed CPU-GPU deployments.

This is not just a storage trick. Smaller weights can reduce load time and memory pressure, and they pair well with the project’s broader goal of making llama.cpp more efficient without forcing a specialized runtime. If you are already comfortable with GGUF workflows, this slot is easy to test.

Weight formats mentioned: TQ4_1S, TQ3_1S
Useful when model size is the main bottleneck
Pairs naturally with quantized assistant heads
Fits the same llama.cpp serving flow

5. Multimodal and cache-friendly serving extras

The fork also extends speculative decoding into multimodal serving. The README says --mmproj can be loaded alongside MTP, NextN, or Eagle3 on a single slot, with text turns benefiting from draft acceleration while image-bearing turns fall back to plain target decoding.

Another practical detail is the Hugging Face cache migration for -hf downloads. Models now land in the standard Hugging Face cache directory, which makes them easier to share with other tools and less annoying to manage across environments.

Single-slot multimodal support with speculative decoding
Text turns can use draft acceleration
Image turns stay on target decoding
Hugging Face cache layout now matches standard tooling

How to decide

If you run Gemma 4 and your bottleneck is memory bandwidth, start with MTP plus TurboQuant KV. If you run Qwen 3.6, NextN is the more direct path, especially for the 35B-A3B MoE where the repo reports the biggest uplift. In both cases, the fork is most useful when you want speed gains without leaving llama.cpp.

If you are mainly trying to shrink memory use, TurboQuant KV and weight compression are the first things to test. If your workload is mostly text and you care about short-prompt latency, MTP is the most compelling feature. If you serve mixed image and text traffic, the multimodal path is worth a look, but expect the image turns to behave like regular target decoding.

// Related Articles

AtomicBot’s llama.cpp fork boosts throughput on two fronts

1. Gemma 4 MTP speculative decoding

Get the latest AI news in your inbox

2. Qwen 3.6 NextN speculative decoding

3. TurboQuant KV cache compression

4. TurboQuant weight compression

5. Multimodal and cache-friendly serving extras

How to decide

陈立武把英特尔改成材料公司

Zilliz Vector Lakebase turns vector search into one platform

Apple’s Gemini Siri deal rewrites AI app strategy

Nvidia CEO Says AI Can Lift Software Stocks

OpenAI Statistics 2026: Users, Revenue, Funding

OpenAI’s custom chip is the right move against Nvidia