[MODEL] 6 min readOraCore Editors

DiffusionGemma runs fast on NVIDIA RTX and DGX

Google DeepMind’s DiffusionGemma generates text in parallel, and NVIDIA says RTX and DGX hardware can run it up to 4x faster.

Share LinkedIn
DiffusionGemma runs fast on NVIDIA RTX and DGX

Google DeepMind’s DiffusionGemma generates text in parallel and runs fastest on NVIDIA RTX and DGX hardware.

Google DeepMind released DiffusionGemma on June 10, 2026, and NVIDIA says the model is already tuned for local inference on GeForce RTX GPUs, RTX PRO workstations, and DGX Spark systems. The pitch is simple: instead of generating one token after another, the model fills in blocks of text in parallel, which changes the speed profile for local AI work.

That matters because the article is not about a new chatbot demo. It is about a different inference style that can make single-user AI feel much snappier on hardware developers can actually buy and keep on a desk.

ClaimNumberWhat it means
Tokens denoised per step256DiffusionGemma fills a block of text in parallel
Model size26BBuilt on Gemma 4 mixture-of-experts
Active parameters per step3.8BOnly part of the model runs each step
Speed on H1001,000 tokens/secNVIDIA’s reported local inference rate
Speed on DGX Spark150 tokens/secReported deskside performance
Speed on DGX Station2,000 tokens/secReported top-end local inference rate

Parallel generation changes the latency game

Get the latest AI news in your inbox

Weekly picks of model releases, tools, and deep dives — no spam, unsubscribe anytime.

No spam. Unsubscribe at any time.

Most large language models are autoregressive. They pick the next token, then the next one, then the next one again. That process is predictable, but it also creates a ceiling on how fast a model can answer when you want something interactive.

DiffusionGemma runs fast on NVIDIA RTX and DGX

DiffusionGemma takes the diffusion route instead. It starts from noise and refines a whole block of text at once, denoising up to 256 tokens per step. In practice, that means the model is built for the kind of short-turn, high-feedback work developers do all day: drafting prompts, iterating on agent plans, and testing local assistants without waiting around for each word to appear.

NVIDIA’s blog frames the hardware angle clearly. Token-by-token generation is memory-bound, while block generation pushes more of the work into compute, which is where GPUs excel. That is why the company is tying this model so tightly to its own stack.

  • 256 tokens are processed per diffusion step instead of one token at a time.
  • The model is based on Gemma 4, a 26-billion-parameter mixture-of-experts system.
  • Only 3.8 billion parameters activate on each step, which keeps the active workload smaller than the full model size.
  • NVIDIA says the model can run up to 4x faster than an equivalent autoregressive model in the same single-user setting.

NVIDIA’s hardware pitch is about local speed

The most interesting part of this announcement is not the model itself. It is the way NVIDIA maps the model onto its hardware portfolio, from consumer GPUs to deskside systems to workstation-class machines.

On a single DGX Spark with the GB10 Grace Blackwell Superchip and 128GB of unified memory, NVIDIA says DiffusionGemma reaches 150 tokens per second. On DGX Station, the company claims up to 2,000 tokens per second and 748GB of coherent memory. On an RTX PRO 6000 workstation, the pitch is local low-latency generation for professional workflows. On GeForce RTX GPUs, support is coming through the standard software stack.

“The ultimate goal of AI is to understand and replicate intelligence.” — Jensen Huang, NVIDIA GTC 2024 keynote

That quote from Jensen Huang fits this release better than the usual marketing line. NVIDIA is betting that local AI matters when models are fast enough to keep up with a developer’s train of thought, and DiffusionGemma is meant to prove it.

  • H100 Tensor Core GPU: 1,000 tokens/sec
  • DGX Spark: 150 tokens/sec
  • DGX Station: up to 2,000 tokens/sec
  • Equivalent autoregressive model: about 4x slower in the same single-user regime

The software stack matters as much as the model

Model speed alone does not make local AI useful. The software path has to be straightforward too, and NVIDIA is trying to remove friction on that side with support across Hugging Face Transformers, vLLM, and Unsloth.

DiffusionGemma runs fast on NVIDIA RTX and DGX

That matters because local AI adoption usually dies in setup, not in benchmarks. If a model needs a custom runtime, special kernels, or weeks of tweaking, most developers move on. NVIDIA is trying to make the path boring: pull the model, run it on RTX or DGX Spark, and start testing.

The Apache 2.0 license also matters. Open weights do not magically make a model easy to deploy, but they do make it easier to inspect, adapt, and ship inside products that cannot depend on a cloud API for every token.

For teams building assistants, coding tools, or agent loops, the practical question is whether the model is fast enough to feel local. NVIDIA is arguing that DiffusionGemma clears that bar on its own hardware, and the reported numbers are strong enough to make that claim worth testing.

What this means for developers right now

If you are building AI tools on a workstation, this release is a reminder that inference style matters as much as parameter count. A 26B model that activates 3.8B parameters per step and fills 256-token blocks can feel very different from a standard decoder model, especially when the loop is interactive.

There is also a broader strategic angle here. NVIDIA is not just selling faster GPUs; it is shaping the default path for local AI by making sure the model, runtime, and hardware arrive together. That is a smart move for the company, and it gives developers a cleaner way to experiment with low-latency generation without waiting for cloud capacity.

If you want to try it, NVIDIA points to Transformers for quick testing, vLLM for serving, and build.nvidia.com for hosted API access. The next thing to watch is whether diffusion-based text generation becomes a standard option for local assistants, or stays a niche technique for teams that care deeply about latency and hardware efficiency.