[IND] 7 min readOraCore Editors

V100 raw GGUF vs prepacked weight cache

This compares raw GGUF Q4_K kernels and prepacked weight caches for V100 decode inference.

Share LinkedIn
V100 raw GGUF vs prepacked weight cache

This compares raw GGUF Q4_K kernels and prepacked weight caches for V100 decode inference.

This comparison is for people tuning small-M decode on a V100 and deciding whether to keep Q4_K weights in the original GGUF layout or pay the one-time cost to prepack them into a GPU-friendly cache.

At a glance

Get the latest AI news in your inbox

Weekly picks of model releases, tools, and deep dives — no spam, unsubscribe anytime.

No spam. Unsubscribe at any time.

DimensionRaw GGUF layoutPrepacked weight cache
Setup cost0 extra VRAM, no repack stepExtra offline or startup pack step, often 1-2x model load time
VRAM footprintLowest, bounded by original GGUF blocksHigher; a practical cache can add 5-20% memory overhead
Kernel efficiencyUsually limited by unpack, address math, and irregular loadsCan cut integer work and improve warp-contiguous reads
Best batch regimeWorks acceptably for M=1..4 when memory is tightUsually better when decode is steady and weights are reused heavily
Volta fitSafer if occupancy and cache pressure are already marginalBetter if you can trade memory for fewer instructions and cleaner access patterns
Typical outcome on V100Good baseline, but often leaves 10-30% on the table in Q4 unpack-heavy GEMMsCan help more than raw layout if the kernel is instruction-bound rather than DRAM-bound

Raw GGUF layout

Raw GGUF is the conservative choice because it preserves the original quantized blocks and avoids spending VRAM on a second copy. That matters on a V100-32GB when the safe cache budget is already fighting with KV cache, activations, and other live tensors. In your case, the fact that adding a cached down-projection displaced gate/up weights is exactly the kind of trade-off that makes raw layout attractive as a baseline.

V100 raw GGUF vs prepacked weight cache

The downside is that raw layout often forces the kernel to do the hardest work on every token: unpack nibbles, fetch scales and mins, compute addresses, and juggle irregular memory access. On Volta, that can show up as L1/TEX, LSU, and integer-pipe pressure even when DRAM throughput is modest. If Nsight says you are not DRAM-bound, raw layout is usually the first thing to question, but only after checking register count and shared memory because those can cap occupancy just as hard.

Prepacked weight cache

A prepacked cache is most useful when the same weights are reused token after token and the decode batch is small enough that you want each GEMM to be as simple as possible. For M=1..4, that often means reorganizing the data so a warp can read contiguous bytes, with scales and mins separated from the nibble stream or expanded into fp16/fp32 if the memory budget allows it. The goal is not to make the model smaller; it is to make the inner loop less branchy and less instruction-heavy.

V100 raw GGUF vs prepacked weight cache

On V100, the best cache layout is usually the one that matches your kernel tile shape, not the one that looks neat in storage. K-major tiling tends to help when the kernel streams through input channels in warp-sized chunks, while N-major packing can help if output columns are assigned in a way that lets threads reuse the same dequantized block. In practice, a hybrid layout that keeps quant blocks contiguous, stores scales/mins separately or in a compact side array, and pre-expands only the values reused across many MACs is often the sweet spot.

What matters most on V100

For a Q4-style decode kernel on Volta, the biggest wins usually come from reducing instruction count and fixing access patterns before touching cache modifiers. If your kernel is already around 48 registers per thread with about 16 KB shared memory per block, then occupancy is only part of the story; the more important question is whether the unpack and address arithmetic are inflating the critical path. In that situation, shaving a few integer ops per block can matter more than a small change in L1 policy.

Cache load modifiers like .cg or .ca are worth testing, but they are rarely the first lever I would pull on V100. They can help if the same metadata is reused across neighboring warps, but they can also backfire by polluting cache or changing locality in ways that do not match your access pattern. Treat them as a microbenchmark pass after you have narrowed down whether the kernel is limited by registers, shared memory, integer unpacking, or memory layout.

LM head and sampling

For greedy decode, copying full vocab logits back to the CPU is usually not the best end-to-end choice once the model body is optimized. If the LM head is already taking roughly 8% and logits sampling another 4%, then a GPU-side argmax that copies back only the token ID is the cleaner path. That avoids a large host round-trip and keeps the decode loop on device, which is especially valuable when batch size is only 4 and latency matters more than bulk throughput.

If you want the least invasive change, keep cuBLAS for the LM head and add a separate GPU reduction kernel for argmax or top-k. If you want the best latency, fuse LM head with the reduction so you never materialize the full logits tensor in a way the CPU has to see. The right answer depends on how much engineering risk you can take, but for production decode on V100, the host copy is usually the part least worth keeping.

When to pick what

Pick raw GGUF if VRAM is tight, your cache budget is already forcing trade-offs, and you need the safest path that preserves token equality with minimal memory overhead. It is the better default when the model must coexist with a large KV cache or when you are still isolating whether the bottleneck is in unpacking, occupancy, or something else.

Pick a prepacked cache if the same weights stay hot across many decode steps and you have enough memory headroom to store a layout that matches your kernel. This is the better choice for engineers who are willing to trade some load-time complexity and VRAM for a simpler inner loop, especially when Nsight shows the kernel is instruction- and address-pressure limited rather than bandwidth-limited.

Pick GPU-side argmax and token-only return if your current decode path still copies full logits to the CPU. That change usually gives a cleaner latency win than more tinkering with the host sampler, and it fits the small-batch, production decode profile described here.

The default pick on V100 is a prepacked cache for the hottest GEMM path, but the answer flips back to raw GGUF when memory pressure is so high that the cache would evict more valuable weights or KV space.