[IND] 6 min readOraCore Editors

Blackwell’s MLPerf sweep shows why training speeds up

5 Blackwell MLPerf 6.0 results show faster training, bigger scale, and better reliability for frontier AI teams.

Share LinkedIn
Blackwell’s MLPerf sweep shows why training speeds up

Blackwell led MLPerf Training 6.0 with faster training, larger scale, and stronger reliability.

In MLPerf Training 6.0, NVIDIA Blackwell posted the fastest time to train on all seven benchmarks and scaled to 8,192 GPUs.

ItemScaleReported result
GB300 NVL72Rack-scaleUp to 1.6x faster than GB200 NVL72
DeepSeek-V3 671B8,192 GPUsFastest time to train at the largest scale
Llama 3.1 405B on Azure8,192 GPUsReference quality in 7.07 minutes
DeepSeek-V3 671B on CoreWeave8,192 GPUsReference quality in 2.02 minutes
Higgsfield on NebiusCloud deployment30% shorter training time

1. Fastest training across all seven benchmarks

Get the latest AI news in your inbox

Weekly picks of model releases, tools, and deep dives — no spam, unsubscribe anytime.

No spam. Unsubscribe at any time.

The headline result is simple: NVIDIA was the only platform submitted across every benchmark in MLPerf Training 6.0, and it delivered the fastest time to train in all seven. That matters because MLPerf is a peer-reviewed benchmark suite, so the results are meant to compare real systems, not marketing claims.

Blackwell’s MLPerf sweep shows why training speeds up

For teams choosing training infrastructure, this is the clearest signal in the batch. It says Blackwell is not tuned for one model family or one lab setup. It is being pushed across dense LLMs, mixture-of-experts workloads, and fine-tuning cases with the same goal: finish training sooner.

  • Seven-for-seven fastest time to train
  • Submitted on both GB200 NVL72 and GB300 NVL72
  • Included new MoE workloads: DeepSeek-V3 671B and GPT-OSS-20B

2. GB300 NVL72’s speed jump over GB200 NVL72

Blackwell Ultra matters because it raises the ceiling inside the same rack-scale design. NVIDIA reported that GB300 NVL72 delivered up to 1.6x faster training than GB200 NVL72 at the same scale, driven by higher compute density with NVFP4, more memory, and a higher power ceiling.

That mix is useful when a model is already large enough that small gains in throughput compound into real schedule savings. If you are running long pretraining jobs or repeated fine-tunes, a 1.6x gain can change how many experiments fit into a week.

Key drivers of the GB300 NVL72 gain: - Higher compute density with NVFP4 - Expanded memory capacity - Higher power ceiling for sustained performance

3. 8,192-GPU scale for MoE and dense models

Scale is the other half of the story. NVIDIA scaled DeepSeek-V3 671B to 8,192 GPUs on GB200 NVL72 systems, which is the largest Blackwell-based submission in MLPerf Training to date. It also submitted Llama 3.1 405B at 5,120 GPUs, showing that the platform is not only about peak single-job speed but also about how far the cluster can stretch.

Blackwell’s MLPerf sweep shows why training speeds up

The networking piece is what makes that scale practical. Within each rack, fifth-generation NVLink Switches connect all 72 GPUs into a shared pool of compute and memory. For distributed clusters, NVIDIA pairs that with Quantum InfiniBand or Spectrum-X Ethernet, depending on the data center design.

  • DeepSeek-V3 671B: 8,192 GPUs
  • Llama 3.1 405B: 5,120 GPUs
  • Rack-scale NVLink Switch fabric across 72 GPUs

4. Partner results that show the platform in production

The most useful part of the blog may be the partner examples, because they show Blackwell outside NVIDIA’s own test cases. Cohere reported 3x faster training on GB200 NVL72 for its North agentic AI platform. Midjourney trained v8 on a Blackwell cluster and is now scaling a large fleet of Blackwell Ultra GPUs on CoreWeave for upcoming image and video models.

There are more signs that the platform is already in production use. Microsoft Azure reached reference quality on Llama 3.1 405B in 7.07 minutes, CoreWeave hit 2.02 minutes on DeepSeek-V3 671B with GB300 NVL72, and Nebius said Higgsfield cut training time by 30% while serving 22 million users and generating over 6 million AI outputs per day.

  • Cohere: 3x faster training on GB200 NVL72
  • Midjourney: training and scaling on Blackwell Ultra GPUs
  • Thinking Machines Lab on Google Cloud: 2x faster training and serving
  • Nebius and Higgsfield: 30% shorter training time

5. Reliability features for long training runs

Performance only matters if a job survives long enough to finish. NVIDIA frames Blackwell’s reliability story around fewer interruptions and faster recovery. Before a GPU reaches a data center, it goes through 30+ manufacturing test stages. In operation, the Reliability, Availability and Serviceability Engine watches nearly the entire chip, while self-healing logic can route around faults without stopping the workload.

At the cluster level, Spectrum-X Ethernet can reroute around failed links in milliseconds. If a fault does interrupt a job, NVIDIA Resiliency Extension, or NVRx, helps resume from a recent checkpoint instead of restarting from zero. That is especially relevant for runs that span weeks or months across hundreds of thousands of GPUs.

Reliability stack: - 30+ manufacturing test stages - RAS Engine monitoring - Self-healing fault routing - Spectrum-X link rerouting - NVRx checkpoint recovery

How to decide

If you want the fastest benchmark story, look at the seven-for-seven MLPerf sweep and the GB300 NVL72 result. If your priority is cluster size, the 8,192-GPU DeepSeek-V3 671B run is the clearest proof point. If you care about real-world adoption, the partner wins from Cohere, Midjourney, Azure, CoreWeave, and Nebius are the strongest signals.

For most AI teams, the practical takeaway is that Blackwell is being positioned as a full training platform, not just a fast GPU. It combines speed, scale, and recovery features in a way that fits frontier model work, where every lost hour and every failed run has a cost.