[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"article-cuda-toolkit-13-3-fixes-nested-divergence-bug-en":3,"article-related-cuda-toolkit-13-3-fixes-nested-divergence-bug-en":30,"series-research-cc337b93-2825-4fcc-a5af-77d41470616c":76},{"id":4,"slug":5,"title":6,"content":7,"summary":8,"source":9,"source_url":10,"author":11,"image_url":12,"cover_image":12,"category":13,"language":14,"translated_content":11,"related_article_id":15,"keywords":16,"key_takeaways":22,"views":26,"created_at":27,"published_at":28,"topic_cluster_id":29},"cc337b93-2825-4fcc-a5af-77d41470616c","cuda-toolkit-13-3-fixes-nested-divergence-bug-en","CUDA Toolkit 13.3 fixes a nested-divergence bug","\u003Cp data-speakable=\"summary\">\u003Ca href=\"\u002Ftag\u002Fcuda\">CUDA\u003C\u002Fa> Toolkit 13.3 fixes a compiler bug that could corrupt registers in nested divergent \u003Ca href=\"\u002Ftag\u002Fgpu\">GPU\u003C\u002Fa> kernels.\u003C\u002Fp>\u003Cp>\u003Ca href=\"\u002Ftag\u002Fnvidia\">NVIDIA\u003C\u002Fa>’s \u003Ca href=\"https:\u002F\u002Fdocs.nvidia.com\u002Fcuda\u002Fcuda-toolkit-release-notes\u002Findex.html\" target=\"_blank\" rel=\"noopener\">CUDA Toolkit 13.3 release notes\u003C\u002Fa> call out a compiler fix that matters more than the version bump suggests. The bug has existed since CUDA 12.8, and in the right kernel shape it could leave stale or corrupted register values behind, which means wrong answers rather than a crash.\u003C\u002Fp>\u003Cp>The release also updates the toolkit component matrix, refreshes driver guidance, and adds platform features such as Event Tracing for Windows support for CUDA driver activity reporting. For teams shipping GPU code, the headline is simple: 13.3 is a maintenance release, but one with a correctness fix that should get attention.\u003C\u002Fp>\u003Ctable>\u003Cthead>\u003Ctr>\u003Cth>Item\u003C\u002Fth>\u003Cth>Value\u003C\u002Fth>\u003C\u002Ftr>\u003C\u002Fthead>\u003Ctbody>\u003Ctr>\u003Ctd>Release\u003C\u002Ftd>\u003Ctd>CUDA Toolkit 13.3\u003C\u002Ftd>\u003C\u002Ftr>\u003Ctr>\u003Ctd>Bug introduced\u003C\u002Ftd>\u003Ctd>CUDA 12.8\u003C\u002Ftd>\u003C\u002Ftr>\u003Ctr>\u003Ctd>Minimum driver for CUDA 13.x\u003C\u002Ftd>\u003Ctd>580 or newer\u003C\u002Ftd>\u003C\u002Ftr>\u003Ctr>\u003Ctd>Windows driver bundling\u003C\u002Ftd>\u003Ctd>Removed starting with CUDA 13.1\u003C\u002Ftd>\u003C\u002Ftr>\u003Ctr>\u003Ctd>New Windows diagnostics\u003C\u002Ftd>\u003Ctd>ETW support for CUDA driver activity\u003C\u002Ftd>\u003C\u002Ftr>\u003C\u002Ftbody>\u003C\u002Ftable>\u003Ch2>A compiler fix that matters for correctness\u003C\u002Fh2>\u003Cp>The most important item in the release notes is a fix for compiler-inserted thread reconvergence. NVIDIA says the issue could appear only in kernels with two or more nested levels of thread divergence, and only when the compiler elided convergence instructions for one or more divergence levels.\u003C\u002Fp>\n\u003Cfigure class=\"my-6\">\u003Cimg src=\"https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1782676985164-f2uv.png\" alt=\"CUDA Toolkit 13.3 fixes a nested-divergence bug\" class=\"rounded-xl w-full\" loading=\"lazy\" \u002F>\u003C\u002Ffigure>\n\u003Cp>That sounds niche, but GPU code often lives in exactly that kind of branching logic: ray tracing, sparse compute, irregular data processing, and control-heavy kernels all create paths where warps split and later come back together. When reconvergence goes wrong, the result is not a clean failure mode. You can get stale register contents, corrupted values, and incorrect execution that is hard to reproduce.\u003C\u002Fp>\u003Cp>For developers, this is the kind of bug that can waste days. The kernel may pass tests on one input, fail on another, and look fine again after a tiny code change. A fix in the compiler matters because it removes a source of silent wrong results without requiring application code changes.\u003C\u002Fp>\u003Cul>\u003Cli>The issue dates back to CUDA 12.8.\u003C\u002Fli>\u003Cli>It affects kernels with nested thread divergence.\u003C\u002Fli>\u003Cli>It can produce wrong output rather than an obvious runtime error.\u003C\u002Fli>\u003Cli>The failure depends on compiler decisions about convergence instructions.\u003C\u002Fli>\u003C\u002Ful>\u003Ch2>What changed in the toolkit release\u003C\u002Fh2>\u003Cp>CUDA 13.3 is also part of NVIDIA’s long-running shift toward independently versioned toolkit components. The release notes list separate versions for \u003Ca href=\"https:\u002F\u002Fdeveloper.nvidia.com\u002Fcuda-toolkit\" target=\"_blank\" rel=\"noopener\">CUDA Toolkit\u003C\u002Fa> pieces such as \u003Ca href=\"https:\u002F\u002Fdeveloper.nvidia.com\u002Fcuda-toolkit\" target=\"_blank\" rel=\"noopener\">NVCC\u003C\u002Fa>, \u003Ca href=\"https:\u002F\u002Fdeveloper.nvidia.com\u002Fcuda-toolkit\" target=\"_blank\" rel=\"noopener\">NVRTC\u003C\u002Fa>, \u003Ca href=\"https:\u002F\u002Fdeveloper.nvidia.com\u002Fcuda-toolkit\" target=\"_blank\" rel=\"noopener\">CUPTI\u003C\u002Fa>, and the CUDA runtime.\u003C\u002Fp>\u003Cp>That versioning model is practical, if a little messy to read. It tells you which parts moved together and which parts are on their own schedule. For example, the toolkit release notes show component versions like CUDA Runtime 13.3.29, NVCC 13.3.33, CUPTI 13.3.35, and CUDA Documentation 13.3.40. Those numbers matter when you are pinning builds or debugging a mismatch between your compiler, runtime, and profiling tools.\u003C\u002Fp>\u003Cp>The platform section also lists the minimum driver requirement for CUDA 13.x as 580 or newer. NVIDIA repeats the compatibility rule that the driver is backward compatible, so an application built against one toolkit version should continue to run on later compatible drivers.\u003C\u002Fp>\u003Cul>\u003Cli>CUDA Runtime: 13.3.29\u003C\u002Fli>\u003Cli>NVCC: 13.3.33\u003C\u002Fli>\u003Cli>CUPTI: 13.3.35\u003C\u002Fli>\u003Cli>CUDA Documentation: 13.3.40\u003C\u002Fli>\u003C\u002Ful>\u003Ch2>Driver policy keeps shifting for Windows users\u003C\u002Fh2>\u003Cp>One of the more practical changes in the CUDA 13.x era is driver packaging. NVIDIA says the toolkit previously included a bundled display driver for convenience, but that bundle was intended for development use only and was not recommended for production, especially on Tesla GPUs.\u003C\u002Fp>\n\u003Cfigure class=\"my-6\">\u003Cimg src=\"https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1782676986146-lwe2.png\" alt=\"CUDA Toolkit 13.3 fixes a nested-divergence bug\" class=\"rounded-xl w-full\" loading=\"lazy\" \u002F>\u003C\u002Ffigure>\n\u003Cp>That policy changed further in CUDA 13.1 on Windows, where the display driver is no longer bundled with the toolkit. Windows users now need to download and install the right driver separately from \u003Ca href=\"https:\u002F\u002Fwww.nvidia.com\u002Fdrivers\" target=\"_blank\" rel=\"noopener\">NVIDIA’s driver downloads page\u003C\u002Fa>. Linux users can still skip driver installation during setup by avoiding the driver \u003Ca href=\"\u002Ftag\u002Fmeta\">meta\u003C\u002Fa> packages.\u003C\u002Fp>\u003Cp>This matters because installation assumptions can quietly break automation. If your CI images, workstation setup scripts, or lab machines still expect the toolkit installer to bring along a driver, CUDA 13.3 will not behave the way older setups did. The release notes also point users to the \u003Ca href=\"https:\u002F\u002Fdocs.nvidia.com\u002Fdeploy\u002Fcuda-compatibility\u002Findex.html\" target=\"_blank\" rel=\"noopener\">CUDA Compatibility Guide for Drivers\u003C\u002Fa> for the fine print.\u003C\u002Fp>\u003Cblockquote>\u003Cp>“CUDA is a software environment that allows developers to use the NVIDIA GPU for general purpose processing.”\u003C\u002Fp>\u003Ccite>NVIDIA, CUDA Toolkit documentation\u003C\u002Fcite>\u003C\u002Fblockquote>\u003Ch2>ETW support and why it matters for Windows profiling\u003C\u002Fh2>\u003Cp>CUDA 13.3 adds Event Tracing for Windows support for CUDA driver activity reporting. ETW is a built-in Windows logging system that has been around for years, and NVIDIA is using it here to expose driver activity with low overhead.\u003C\u002Fp>\u003Cp>That is useful for debugging and performance analysis because it gives Windows teams another way to observe what the GPU stack is doing without relying only on higher-level tools. If you work in enterprise Windows environments, this kind of telemetry often matters as much as raw kernel performance, because it helps explain stalls, launch latency, and system-level interactions.\u003C\u002Fp>\u003Cp>The release notes also mention mmap() support for DMA-BUF file descriptors, which points to continued work on interoperability and memory handling. Taken together, the platform updates are less flashy than a new model announcement, but they are the kind of changes that reduce friction for teams shipping real software.\u003C\u002Fp>\u003Cul>\u003Cli>ETW adds low-overhead reporting on Windows.\u003C\u002Fli>\u003Cli>DMA-BUF mmap() support improves interoperability paths.\u003C\u002Fli>\u003Cli>Driver activity becomes easier to inspect in diagnostics workflows.\u003C\u002Fli>\u003Cli>These changes target debugging and analysis, not just raw speed.\u003C\u002Fli>\u003C\u002Ful>\u003Ch2>How CUDA 13.3 compares with the 13.x line\u003C\u002Fh2>\u003Cp>Compared with the rest of the CUDA 13.x series, 13.3 looks like a release focused on cleanup and operational clarity. NVIDIA’s version table shows a broad stack of components already moving independently, from libraries like \u003Ca href=\"https:\u002F\u002Fdeveloper.nvidia.com\u002Fcublas\" target=\"_blank\" rel=\"noopener\">cuBLAS\u003C\u002Fa>, \u003Ca href=\"https:\u002F\u002Fdeveloper.nvidia.com\u002Fcufft\" target=\"_blank\" rel=\"noopener\">cuFFT\u003C\u002Fa>, and \u003Ca href=\"https:\u002F\u002Fdeveloper.nvidia.com\u002Fcusparse\" target=\"_blank\" rel=\"noopener\">cuSPARSE\u003C\u002Fa> to tools like Nsight Compute and Nsight Systems.\u003C\u002Fp>\u003Cp>That means the toolkit release is less about one giant feature and more about keeping a large stack aligned. In practice, the numbers tell the story:\u003C\u002Fp>\u003Cul>\u003Cli>CUDA 13.x requires driver 580 or newer.\u003C\u002Fli>\u003Cli>CUDA 13.1 removed the bundled Windows display driver.\u003C\u002Fli>\u003Cli>CUDA 13.3 ships with updated component versions across compiler, runtime, profiling, and docs.\u003C\u002Fli>\u003Cli>The corrected reconvergence bug dates back one major minor release to CUDA 12.8.\u003C\u002Fli>\u003C\u002Ful>\u003Cp>For teams maintaining production GPU code, the comparison that matters is not 13.3 versus 13.2 in marketing terms. It is whether your kernels contain nested divergence, whether your builds depend on the affected compiler behavior, and whether your deployment process still assumes the old driver packaging model.\u003C\u002Fp>\u003Cp>If your code base uses heavy branching inside kernels, 13.3 is worth testing sooner rather than later. The safest move is to run the same workloads under 13.3, compare outputs against known-good baselines, and watch for any code paths that depend on deep divergence. If nothing else, this release is a reminder that compiler behavior can change the correctness of GPU programs in ways that are easy to miss until production data exposes them.\u003C\u002Fp>\u003Cp>One open question is how many teams will treat this as a routine toolkit update versus a must-validate release. If your kernels are branch-heavy, the answer should be obvious: treat 13.3 like a correctness patch, not just another point release.\u003C\u002Fp>","CUDA Toolkit 13.3 fixes a compiler bug from 12.8 that could corrupt registers in deeply divergent GPU kernels.","docs.nvidia.com","https:\u002F\u002Fdocs.nvidia.com\u002Fcuda\u002Fcuda-toolkit-release-notes\u002Findex.html",null,"https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1782676985164-f2uv.png","research","en","5431a65e-76da-4a2a-96c5-73a6a7635903",[17,18,19,20,21],"CUDA Toolkit 13.3","NVIDIA","GPU compiler","thread divergence","driver compatibility",[23,24,25],"CUDA 13.3 fixes a compiler bug from CUDA 12.8 that could corrupt registers in nested divergent kernels.","NVIDIA now requires separate driver handling on Windows, and CUDA 13.x needs driver 580 or newer.","The release adds ETW support on Windows and updates many toolkit components independently.",0,"2026-06-28T20:02:39.771125+00:00","2026-06-28T20:02:39.762+00:00","3103988e-c4fe-45e3-98ab-846500c9d507",{"tags":31,"relatedLang":35,"relatedPosts":39},[32],{"name":33,"slug":34},"Nvidia","nvidia",{"id":15,"slug":36,"title":37,"language":38},"cuda-toolkit-13-3-fixes-nested-divergence-bug-zh","CUDA 13.3 修掉巢狀分歧編譯錯誤","zh",[40,46,52,58,64,70],{"id":41,"slug":42,"title":43,"cover_image":44,"image_url":44,"created_at":45,"category":13},"6dcd4b03-8352-43b0-969a-c030e48afb3c","eagle3-real-speedup-kimi-k25-mi325x-en","EAGLE3 is the real speedup for Kimi-K2.5 on MI325X","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1782640973161-00wl.png","2026-06-28T10:02:26.706213+00:00",{"id":47,"slug":48,"title":49,"cover_image":50,"image_url":50,"created_at":51,"category":13},"772c0694-0e86-465d-b676-012a2240eaf7","llm-fine-tuning-turns-generic-models-into-domain-tools-en","LLM fine-tuning turns generic models into domain tools","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1782569906260-hdga.png","2026-06-27T14:17:57.190952+00:00",{"id":53,"slug":54,"title":55,"cover_image":56,"image_url":56,"created_at":57,"category":13},"25aef6a0-efaa-459c-bca4-77f0d462b792","rust-learners-need-permission-to-clone-first-en","Rust learners need permission to clone first, optimize later","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1782552763890-fem3.png","2026-06-27T09:32:21.788692+00:00",{"id":59,"slug":60,"title":61,"cover_image":62,"image_url":62,"created_at":63,"category":13},"567f2a82-494e-493a-9d43-00dfbc8a7bfd","mistral-ocr-4-document-ai-structure-en","Mistral OCR 4 brings structure to document AI","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1782468180808-ulcg.png","2026-06-26T10:02:37.910976+00:00",{"id":65,"slug":66,"title":67,"cover_image":68,"image_url":68,"created_at":69,"category":13},"de74bbd4-e3b6-407a-998b-b38c4170b586","autoregressive-boltzmann-generators-ditch-flows-en","Autoregressive Boltzmann Generators ditch flows","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1782455575877-62qe.png","2026-06-26T06:32:30.585573+00:00",{"id":71,"slug":72,"title":73,"cover_image":74,"image_url":74,"created_at":75,"category":13},"c05899fc-dd62-4fad-a249-9748376c1ef2","river-llm-reinforcement-learning-without-answers-en","RiVER trains LLMs without ground-truth answers","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1782454678234-6mk1.png","2026-06-26T06:17:27.491779+00:00",[77,82,87,92,97,102,107,112,117,122],{"id":78,"slug":79,"title":80,"created_at":81},"a2715e72-1fe8-41b3-abb1-d0cf1f710189","ai-predictions-2026-big-changes-en","AI Predictions for 2026: Brace for Big Changes","2026-03-26T01:25:07.788356+00:00",{"id":83,"slug":84,"title":85,"created_at":86},"8404bd7b-4c2f-4109-9ec4-baf29d88af2b","ml-papers-of-the-week-github-research-desk-en","ML Papers of the Week Turns GitHub Into a Research Desk","2026-03-27T01:11:39.480259+00:00",{"id":88,"slug":89,"title":90,"created_at":91},"87897a94-8065-4464-a016-1f23e89e17cc","ai-ml-conferences-to-watch-in-2026-en","AI\u002FML Conferences to Watch in 2026","2026-03-27T01:51:54.184108+00:00",{"id":93,"slug":94,"title":95,"created_at":96},"6f1987cf-25f3-47a4-b3e6-db0997695be8","openclaw-agents-manipulated-self-sabotage-en","OpenClaw Agents Can Be Manipulated Into Failure","2026-03-28T03:03:18.899465+00:00",{"id":98,"slug":99,"title":100,"created_at":101},"a53571ad-735a-4178-9f93-cb09b699d99c","vega-driving-language-instructions-en","Vega: Driving with Natural Language Instructions","2026-03-28T14:54:04.698882+00:00",{"id":103,"slug":104,"title":105,"created_at":106},"a34581d6-f36e-46da-88bb-582fb3e7425c","personalizing-autonomous-driving-styles-en","Drive My Way: Personalizing Autonomous Driving Styles","2026-03-28T14:54:26.148181+00:00",{"id":108,"slug":109,"title":110,"created_at":111},"2bc1ad7f-26ce-4f02-9885-803b35fd229d","training-knowledge-bases-writeback-rag-en","Training Knowledge Bases with WriteBack-RAG","2026-03-28T14:54:45.643433+00:00",{"id":113,"slug":114,"title":115,"created_at":116},"71adc507-3c54-4605-bbe2-c966acd6187e","packforcing-long-video-generation-en","PackForcing: Efficient Long-Video Generation Method","2026-03-28T14:55:02.646943+00:00",{"id":118,"slug":119,"title":120,"created_at":121},"675942ef-b9ec-4c5f-a997-381250b6eacb","pixelsmile-facial-expression-editing-en","PixelSmile Framework Enhances Facial Expression Editing","2026-03-28T14:55:20.633463+00:00",{"id":123,"slug":124,"title":125,"created_at":126},"6954fa2b-8b66-4839-884b-e46f89fa1bc3","adaptive-block-scaled-data-types-en","IF4: Smarter 4-Bit Quantization That Adapts to Your Data","2026-03-31T06:00:36.65963+00:00"]