[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"article-diffusiongemma-runs-fast-on-nvidia-rtx-dgx-en":3,"article-related-diffusiongemma-runs-fast-on-nvidia-rtx-dgx-en":30,"series-model-release-8fe33efd-3a68-4fe3-935f-f0f5d3f058fc":75},{"id":4,"slug":5,"title":6,"content":7,"summary":8,"source":9,"source_url":10,"author":11,"image_url":12,"cover_image":12,"category":13,"language":14,"translated_content":11,"related_article_id":15,"keywords":16,"key_takeaways":22,"views":26,"created_at":27,"published_at":28,"topic_cluster_id":29},"8fe33efd-3a68-4fe3-935f-f0f5d3f058fc","diffusiongemma-runs-fast-on-nvidia-rtx-dgx-en","DiffusionGemma runs fast on NVIDIA RTX and DGX","\u003Cp data-speakable=\"summary\">\u003Ca href=\"\u002Ftag\u002Fgoogle-deepmind\">Google DeepMind\u003C\u002Fa>’s DiffusionGemma generates text in parallel and runs fastest on \u003Ca href=\"\u002Ftag\u002Fnvidia\">NVIDIA\u003C\u002Fa> RTX and DGX hardware.\u003C\u002Fp>\u003Cp>\u003Ca href=\"\u002Ftag\u002Fgoogle\">Google\u003C\u002Fa> DeepMind released \u003Ca href=\"https:\u002F\u002Fdeepmind.google\u002Ftechnologies\u002Fgemma\u002F\" target=\"_blank\" rel=\"noopener\">DiffusionGemma\u003C\u002Fa> on June 10, 2026, and \u003Ca href=\"https:\u002F\u002Fblogs.nvidia.com\" target=\"_blank\" rel=\"noopener\">NVIDIA\u003C\u002Fa> says the model is already tuned for local \u003Ca href=\"\u002Ftag\u002Finference\">inference\u003C\u002Fa> on \u003Ca href=\"https:\u002F\u002Fwww.nvidia.com\u002Fen-us\u002Fgeforce\u002F\" target=\"_blank\" rel=\"noopener\">GeForce RTX\u003C\u002Fa> GPUs, \u003Ca href=\"https:\u002F\u002Fwww.nvidia.com\u002Fen-us\u002Fworkstations\u002Frtx-pro\u002F\" target=\"_blank\" rel=\"noopener\">RTX PRO\u003C\u002Fa> workstations, and \u003Ca href=\"https:\u002F\u002Fwww.nvidia.com\u002Fen-us\u002Fdata-center\u002Fdgx-spark\u002F\" target=\"_blank\" rel=\"noopener\">DGX Spark\u003C\u002Fa> systems. The pitch is simple: instead of generating one token after another, the model fills in blocks of text in parallel, which changes the speed profile for local AI work.\u003C\u002Fp>\u003Cp>That matters because the article is not about a new chatbot demo. It is about a different inference style that can make single-user AI feel much snappier on hardware developers can actually buy and keep on a desk.\u003C\u002Fp>\u003Ctable>\u003Cthead>\u003Ctr>\u003Cth>Claim\u003C\u002Fth>\u003Cth>Number\u003C\u002Fth>\u003Cth>What it means\u003C\u002Fth>\u003C\u002Ftr>\u003C\u002Fthead>\u003Ctbody>\u003Ctr>\u003Ctd>Tokens denoised per step\u003C\u002Ftd>\u003Ctd>256\u003C\u002Ftd>\u003Ctd>DiffusionGemma fills a block of text in parallel\u003C\u002Ftd>\u003C\u002Ftr>\u003Ctr>\u003Ctd>Model size\u003C\u002Ftd>\u003Ctd>26B\u003C\u002Ftd>\u003Ctd>Built on Gemma 4 mixture-of-experts\u003C\u002Ftd>\u003C\u002Ftr>\u003Ctr>\u003Ctd>Active parameters per step\u003C\u002Ftd>\u003Ctd>3.8B\u003C\u002Ftd>\u003Ctd>Only part of the model runs each step\u003C\u002Ftd>\u003C\u002Ftr>\u003Ctr>\u003Ctd>Speed on H100\u003C\u002Ftd>\u003Ctd>1,000 tokens\u002Fsec\u003C\u002Ftd>\u003Ctd>NVIDIA’s reported local inference rate\u003C\u002Ftd>\u003C\u002Ftr>\u003Ctr>\u003Ctd>Speed on DGX Spark\u003C\u002Ftd>\u003Ctd>150 tokens\u002Fsec\u003C\u002Ftd>\u003Ctd>Reported deskside performance\u003C\u002Ftd>\u003C\u002Ftr>\u003Ctr>\u003Ctd>Speed on DGX Station\u003C\u002Ftd>\u003Ctd>2,000 tokens\u002Fsec\u003C\u002Ftd>\u003Ctd>Reported top-end local inference rate\u003C\u002Ftd>\u003C\u002Ftr>\u003C\u002Ftbody>\u003C\u002Ftable>\u003Ch2>Parallel generation changes the latency game\u003C\u002Fh2>\u003Cp>Most large language models are autoregressive. They pick the next token, then the next one, then the next one again. That process is predictable, but it also creates a ceiling on how fast a model can answer when you want something interactive.\u003C\u002Fp>\n\u003Cfigure class=\"my-6\">\u003Cimg src=\"https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1782570781225-7xo9.png\" alt=\"DiffusionGemma runs fast on NVIDIA RTX and DGX\" class=\"rounded-xl w-full\" loading=\"lazy\" \u002F>\u003C\u002Ffigure>\n\u003Cp>DiffusionGemma takes the diffusion route instead. It starts from noise and refines a whole block of text at once, denoising up to 256 tokens per step. In practice, that means the model is built for the kind of short-turn, high-feedback work developers do all day: drafting prompts, iterating on \u003Ca href=\"\u002Ftag\u002Fagent\">agent\u003C\u002Fa> plans, and testing local assistants without waiting around for each word to appear.\u003C\u002Fp>\u003Cp>NVIDIA’s blog frames the hardware angle clearly. Token-by-token generation is memory-bound, while block generation pushes more of the work into compute, which is where GPUs excel. That is why the company is tying this model so tightly to its own stack.\u003C\u002Fp>\u003Cul>\u003Cli>256 tokens are processed per diffusion step instead of one token at a time.\u003C\u002Fli>\u003Cli>The model is based on Gemma 4, a 26-billion-parameter mixture-of-experts system.\u003C\u002Fli>\u003Cli>Only 3.8 billion parameters activate on each step, which keeps the active workload smaller than the full model size.\u003C\u002Fli>\u003Cli>NVIDIA says the model can run up to 4x faster than an equivalent autoregressive model in the same single-user setting.\u003C\u002Fli>\u003C\u002Ful>\u003Ch2>NVIDIA’s hardware pitch is about local speed\u003C\u002Fh2>\u003Cp>The most interesting part of this announcement is not the model itself. It is the way NVIDIA maps the model onto its hardware portfolio, from consumer GPUs to deskside systems to workstation-class machines.\u003C\u002Fp>\u003Cp>On a single \u003Ca href=\"https:\u002F\u002Fwww.nvidia.com\u002Fen-us\u002Fdata-center\u002Fdgx-spark\u002F\" target=\"_blank\" rel=\"noopener\">DGX Spark\u003C\u002Fa> with the GB10 Grace Blackwell Superchip and 128GB of unified memory, NVIDIA says DiffusionGemma reaches 150 tokens per second. On \u003Ca href=\"https:\u002F\u002Fwww.nvidia.com\u002Fen-us\u002Fdata-center\u002Fdgx-station\u002F\" target=\"_blank\" rel=\"noopener\">DGX Station\u003C\u002Fa>, the company claims up to 2,000 tokens per second and 748GB of coherent memory. On an \u003Ca href=\"https:\u002F\u002Fwww.nvidia.com\u002Fen-us\u002Fdata-center\u002Frtx-pro\u002F\" target=\"_blank\" rel=\"noopener\">RTX PRO 6000\u003C\u002Fa> workstation, the pitch is local low-latency generation for professional workflows. On GeForce RTX GPUs, support is coming through the standard software stack.\u003C\u002Fp>\u003Cblockquote>“The ultimate goal of AI is to understand and replicate intelligence.” — Jensen Huang, NVIDIA GTC 2024 keynote\u003C\u002Fblockquote>\u003Cp>That quote from \u003Ca href=\"https:\u002F\u002Fwww.nvidia.com\u002Fen-us\u002Fon-demand\u002Fsession\u002Fgtc24-s62798\u002F\" target=\"_blank\" rel=\"noopener\">Jensen Huang\u003C\u002Fa> fits this release better than the usual marketing line. NVIDIA is betting that local AI matters when models are fast enough to keep up with a developer’s train of thought, and DiffusionGemma is meant to prove it.\u003C\u002Fp>\u003Cul>\u003Cli>H100 Tensor Core GPU: 1,000 tokens\u002Fsec\u003C\u002Fli>\u003Cli>DGX Spark: 150 tokens\u002Fsec\u003C\u002Fli>\u003Cli>DGX Station: up to 2,000 tokens\u002Fsec\u003C\u002Fli>\u003Cli>Equivalent autoregressive model: about 4x slower in the same single-user regime\u003C\u002Fli>\u003C\u002Ful>\u003Ch2>The software stack matters as much as the model\u003C\u002Fh2>\u003Cp>Model speed alone does not make local AI useful. The software path has to be straightforward too, and NVIDIA is trying to remove friction on that side with support across \u003Ca href=\"https:\u002F\u002Fhuggingface.co\u002Fdocs\u002Ftransformers\u002Findex\" target=\"_blank\" rel=\"noopener\">Hugging Face Transformers\u003C\u002Fa>, \u003Ca href=\"https:\u002F\u002Fgithub.com\u002Fvllm-project\u002Fvllm\" target=\"_blank\" rel=\"noopener\">vLLM\u003C\u002Fa>, and \u003Ca href=\"https:\u002F\u002Fgithub.com\u002Funslothai\u002Funsloth\" target=\"_blank\" rel=\"noopener\">Unsloth\u003C\u002Fa>.\u003C\u002Fp>\n\u003Cfigure class=\"my-6\">\u003Cimg src=\"https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1782570777691-4b4n.png\" alt=\"DiffusionGemma runs fast on NVIDIA RTX and DGX\" class=\"rounded-xl w-full\" loading=\"lazy\" \u002F>\u003C\u002Ffigure>\n\u003Cp>That matters because local AI adoption usually dies in setup, not in benchmarks. If a model needs a custom runtime, special kernels, or weeks of tweaking, most developers move on. NVIDIA is trying to make the path boring: pull the model, run it on RTX or DGX Spark, and start testing.\u003C\u002Fp>\u003Cp>The Apache 2.0 license also matters. Open weights do not magically make a model easy to deploy, but they do make it easier to inspect, adapt, and ship inside products that cannot depend on a cloud API for every token.\u003C\u002Fp>\u003Cp>For teams building assistants, coding tools, or agent loops, the practical question is whether the model is fast enough to feel local. NVIDIA is arguing that DiffusionGemma clears that bar on its own hardware, and the reported numbers are strong enough to make that claim worth testing.\u003C\u002Fp>\u003Ch2>What this means for developers right now\u003C\u002Fh2>\u003Cp>If you are building AI tools on a workstation, this release is a reminder that inference style matters as much as parameter count. A 26B model that activates 3.8B parameters per step and fills 256-token blocks can feel very different from a standard decoder model, especially when the loop is interactive.\u003C\u002Fp>\u003Cp>There is also a broader strategic angle here. NVIDIA is not just selling faster GPUs; it is shaping the default path for local AI by making sure the model, runtime, and hardware arrive together. That is a smart move for the company, and it gives developers a cleaner way to experiment with low-latency generation without waiting for cloud capacity.\u003C\u002Fp>\u003Cp>If you want to try it, NVIDIA points to \u003Ca href=\"https:\u002F\u002Fhuggingface.co\u002Fdocs\u002Ftransformers\u002Findex\" target=\"_blank\" rel=\"noopener\">Transformers\u003C\u002Fa> for quick testing, \u003Ca href=\"https:\u002F\u002Fgithub.com\u002Fvllm-project\u002Fvllm\" target=\"_blank\" rel=\"noopener\">vLLM\u003C\u002Fa> for serving, and \u003Ca href=\"https:\u002F\u002Fbuild.nvidia.com\" target=\"_blank\" rel=\"noopener\">build.nvidia.com\u003C\u002Fa> for hosted API access. The next thing to watch is whether diffusion-based text generation becomes a standard option for local assistants, or stays a niche technique for teams that care deeply about latency and hardware efficiency.\u003C\u002Fp>","Google DeepMind’s DiffusionGemma generates text in parallel, and NVIDIA says RTX and DGX hardware can run it up to 4x faster.","blogs.nvidia.com","https:\u002F\u002Fblogs.nvidia.com\u002Fblog\u002Frtx-ai-garage-local-gemma-diffusion\u002F",null,"https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1782570781225-7xo9.png","model-release","en","9258a3d6-b70c-493d-84b9-c791df86f495",[17,18,19,20,21],"DiffusionGemma","NVIDIA RTX","DGX Spark","Google DeepMind","local AI",[23,24,25],"DiffusionGemma generates text in parallel instead of one token at a time.","NVIDIA says the model reaches up to 2,000 tokens\u002Fsec on DGX Station and 1,000 on H100.","The model is open under Apache 2.0 and already has support in Transformers, vLLM, and Unsloth.",0,"2026-06-27T14:32:34.997765+00:00","2026-06-27T14:32:34.985+00:00","1bae1133-d241-4581-9332-fbf39690c319",{"tags":31,"relatedLang":34,"relatedPosts":38},[32],{"name":20,"slug":33},"google-deepmind",{"id":15,"slug":35,"title":36,"language":37},"diffusiongemma-runs-fast-on-nvidia-rtx-dgx-zh","DiffusionGemma 在 RTX 與 DGX 跑很快","zh",[39,45,51,57,63,69],{"id":40,"slug":41,"title":42,"cover_image":43,"image_url":43,"created_at":44,"category":13},"35368bfc-0dbe-45dc-b422-87b1bd350ac0","google-openrl-llm-fine-tuning-kubernetes-en","Google OpenRL brings RL fine-tuning to Kubernetes","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1782572578249-jlty.png","2026-06-27T15:02:27.543012+00:00",{"id":46,"slug":47,"title":48,"cover_image":49,"image_url":49,"created_at":50,"category":13},"ce53e9e6-c310-4434-9971-4f4f3a274577","glm-52-beats-gpt-55-coding-benchmarks-en","GLM-5.2 beats GPT-5.5 on coding tests","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1782564469790-2zyi.png","2026-06-27T12:47:27.758841+00:00",{"id":52,"slug":53,"title":54,"cover_image":55,"image_url":55,"created_at":56,"category":13},"730a2199-d009-4a27-8f00-8e9ea6a4b02e","openai-gpt-56-rollout-us-request-en","OpenAI narrows GPT-5.6 rollout after U.S. request","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1782555472898-iuil.png","2026-06-27T10:17:28.937624+00:00",{"id":58,"slug":59,"title":60,"cover_image":61,"image_url":61,"created_at":62,"category":13},"cdd8e455-ff2d-41a2-b049-61f96d568b32","ubuntu-2610-snapshot-2-gnome-50-kernel-70-en","Ubuntu 26.10 Snapshot 2 adds GNOME 50 and kernel 7.0","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1782536575781-39jk.png","2026-06-27T05:02:31.246533+00:00",{"id":64,"slug":65,"title":66,"cover_image":67,"image_url":67,"created_at":68,"category":13},"9d72ed34-e7be-4628-919c-6591cad14032","claude-fable-5-mythos-5-launch-1m-context-pricing-en","Claude Fable 5 launches with 1M context, $10\u002F$50 pricing","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1782518558846-6mxu.png","2026-06-27T00:02:13.485542+00:00",{"id":70,"slug":71,"title":72,"cover_image":73,"image_url":73,"created_at":74,"category":13},"15ededcb-01f7-408c-a9ad-cd71712b010b","google-gemini-35-pro-july-release-delay-en","Google Pushes Gemini 3.5 Pro to July","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1782439377509-lxcl.png","2026-06-26T02:02:28.584771+00:00",[76,81,86,91,96,101,106,111,116,121],{"id":77,"slug":78,"title":79,"created_at":80},"d4cffde7-9b50-4cc7-bb68-8bc9e3b15477","nvidia-rubin-ai-supercomputer-en","NVIDIA Unveils Rubin: A Leap in AI Supercomputing","2026-03-25T16:24:35.155565+00:00",{"id":82,"slug":83,"title":84,"created_at":85},"eab919b9-fbac-4048-89fc-afad6749ccef","google-gemini-ai-innovations-2026-en","Google's AI Leap with Gemini Innovations in 2026","2026-03-25T16:27:18.841838+00:00",{"id":87,"slug":88,"title":89,"created_at":90},"5f5cfc67-3384-4816-a8f6-19e44d90113d","gap-google-gemini-ai-checkout-en","Gap Teams Up with Google Gemini for AI-Driven Checkout","2026-03-25T16:27:46.483272+00:00",{"id":92,"slug":93,"title":94,"created_at":95},"f6d04567-47f6-49ec-804c-52e61ab91225","ai-model-release-wave-march-2026-en","Navigating the AI Model Release Wave of March 2026","2026-03-25T16:28:45.409716+00:00",{"id":97,"slug":98,"title":99,"created_at":100},"895c150c-569e-4fdf-939d-dade785c990e","small-language-models-transform-ai-en","Small Language Models: Llama 3.2 and Phi-3 Transform AI","2026-03-25T16:30:26.688313+00:00",{"id":102,"slug":103,"title":104,"created_at":105},"38eb1d26-d961-4fd3-ae12-9c4089680f5f","midjourney-v8-alpha-features-pricing-en","Midjourney V8 Alpha: A Deep Dive into Its Features and Pricing","2026-03-26T01:25:36.387587+00:00",{"id":107,"slug":108,"title":109,"created_at":110},"bf36bb9e-3444-4fb8-ab19-0df6bc9d8271","rag-2026-indispensable-ai-bridge-en","RAG in 2026: The Indispensable AI Bridge","2026-03-26T01:28:34.472046+00:00",{"id":112,"slug":113,"title":114,"created_at":115},"60881d6d-2310-44ef-b1fb-7f98e9dd2f0e","xiaomi-mimo-trio-agents-robots-voice-en","Xiaomi’s MiMo trio targets agents, robots, and voice","2026-03-28T03:05:08.899895+00:00",{"id":117,"slug":118,"title":119,"created_at":120},"f063d8d1-41d1-4de4-8ebc-6c40511b9369","xiaomi-mimo-v2-pro-1t-moe-agents-en","Xiaomi MiMo-V2-Pro: 1T MoE Model for Agents","2026-03-28T03:06:19.238032+00:00",{"id":122,"slug":123,"title":124,"created_at":125},"a1379e9a-6785-4ff5-9b0a-8cff55f8264f","cursor-composer-2-started-from-kimi-en","Cursor’s Composer 2 started from Kimi","2026-03-28T03:11:59.132398+00:00"]