[TOOLS] 17 min readOraCore Editors

cuda-oxide turns Rust into PTX kernels

I break down cuda-oxide’s Rust-to-CUDA flow and give you a copyable template for writing PTX kernels in Rust.

Share LinkedIn
cuda-oxide turns Rust into PTX kernels

I break down cuda-oxide’s Rust-to-CUDA flow and give you a copyable template for writing PTX kernels in Rust.

I've been using GPU toolchains long enough to stop trusting anything that says “just write normal code.” Usually that means a small lie and a pile of glue. You get a host API in one language, device code in another, a macro layer in the middle, and then you spend your afternoon figuring out why a type that compiled on the CPU got shredded the moment it crossed into the kernel. That’s the part that kept bothering me about CUDA work: the code was never really mine end-to-end. It was host Rust talking to C++ talking to CUDA talking to whatever build script got the least amount of attention.

So when I hit NVlabs/cuda-oxide on GitHub, I paid attention for the wrong reason first: not because it was shiny, but because it was trying to remove the seam I keep tripping over. It compiles standard Rust directly to PTX, keeps host and device code in one workspace, and pushes the whole thing through a Rust-native pipeline. That immediately made me suspicious, which is usually a good sign. Suspicious tools are the ones worth reading carefully.

The repo is still clearly in alpha, and the maintainers say that plainly. Good. I’d rather have an honest experimental compiler than another “production-ready” story that falls apart the moment a warp barrier shows up. What got me was the shape of the project: it’s not trying to be a toy DSL. It’s trying to make SIMT kernels feel like regular Rust without pretending the GPU is magically the same as the CPU.

Source anchor: the project lives in the NVlabs/cuda-oxide repository, and the main reference is the in-progress cuda-oxide book. The README calls out the compiler pipeline, the async runtime, and the kernel model, but it also says this is alpha and API breakage is expected.

What I actually care about: one Rust file, two worlds

Get the latest AI news in your inbox

Weekly picks of model releases, tools, and deep dives — no spam, unsubscribe anytime.

No spam. Unsubscribe at any time.

“single-source compilation -- host and device code live in the same file, built with one cargo oxide build”

What this actually means is that I’m not bouncing between a host crate, a device crate, and a wrapper crate just to keep a vector-add example alive. cuda-oxide wants host code and GPU kernels in the same Rust workspace, then compiles the device side to PTX through its own backend. That’s the part that matters. The project is not just “Rust syntax on top of CUDA.” It’s trying to let Rust own the whole story, from launch code to kernel body.

cuda-oxide turns Rust into PTX kernels

I’ve run into the opposite setup too many times. You start with a clean Rust app, then the moment you need a kernel you’re suddenly managing a second language, a second build path, and a second mental model. Even when it works, it feels like you’re renting the GPU from your own codebase. cuda-oxide is trying to stop that split-brain setup.

The README describes the pipeline as Rust → Rust MIR → Pliron IR → LLVM IR → PTX. That matters because it tells me this isn’t some handwritten code generator bolted onto a macro. It’s a compiler backend. The project is using LLVM on the back end, and it uses Pliron as an MLIR-like IR framework in Rust. That’s a serious stack, and also a lot of moving parts, which is exactly why I’d want a project like this to be explicit about its boundaries.

How to apply it: if you’re evaluating cuda-oxide for real work, don’t start by thinking “can it replace CUDA today?” Start by asking “can I keep my kernel logic in the same Rust module as the code that launches it?” If your team spends time translating types across language boundaries, this project is trying to delete that tax. If your team already has a clean CUDA pipeline, the value is less obvious until you hit maintenance pain.

  • Use one workspace for host and device code.
  • Treat the compiler as the product, not the macros.
  • Look for places where language boundaries are causing bugs, not just friction.

The #[kernel] story is better than another macro trick

“a rustc codegen backend that compiles #[kernel] functions to CUDA PTX”

What this actually means is that a kernel is still a Rust function, but it gets lowered by the compiler into GPU code. That sounds small until you compare it with the usual “special function signature plus special launcher plus special syntax” routine. Here, the kernel is not a foreign artifact. It’s Rust that the compiler understands as device code.

I like that the README’s quick start shows a generic function with a closure capture being passed into the kernel launch. That’s a very Rust-shaped way to think about GPU work. Instead of hand-writing a bunch of CUDA boilerplate for every parameter, you let monomorphization and the compiler do the boring part. The example uses a closure like move |x: f32| x * factor, and cuda-oxide scalarizes and passes it to the GPU automatically. That’s a good sign: it means the project is trying to preserve Rust ergonomics where it can.

I’ve personally been burned by “nice” macro APIs that only stay nice until you need generics, captures, or a second kernel variant. Then the façade cracks and you’re back to writing launch plumbing by hand. cuda-oxide looks more honest than that because it admits the compiler has to understand the shape of your program. It’s not pretending macros alone can save you.

How to apply it: when you read the book, focus on the kernel authoring model first. Don’t start with async, and don’t start with fancy tensor ops. Start with how a Rust function becomes PTX, what rules apply to parameters, and which parts of normal Rust are preserved. If you can’t explain the kernel lowering model in one paragraph, you’re not ready to bet a codebase on it.

  • Prefer idiomatic Rust function shapes over custom device DSLs.
  • Test how generics and closures behave before adopting the tool.
  • Assume the compiler, not the macro, is where the real contract lives.

The host runtime is the part people underestimate

“a host-side runtime for memory management, pinned host transfers, and kernel launching”

What this actually means is that cuda-oxide isn’t only about compiling kernels. It’s also trying to own the annoying host-side work that makes kernels usable: allocating device memory, moving data, pinning host buffers, and launching work with typed helpers. That’s the stuff nobody puts on the demo slide, but it’s where most of the integration pain lives.

cuda-oxide turns Rust into PTX kernels

The README points at crates like cuda-core and cuda-async. That split is useful. I read it as: one crate handles the safe-ish core runtime primitives, and another handles asynchronous device operations. In practice, that’s the difference between “I can launch a kernel” and “I can build a real pipeline without turning every step into callback soup.”

I ran into this exact problem in a GPU data path where the kernel code was fine, but the host orchestration was a mess. The code was correct and still miserable because every memory transfer and stream sync had to be hand-wired. A runtime layer that speaks Rust types on the host side is not a luxury there. It’s the whole point.

How to apply it: if you’re trying cuda-oxide, inspect the host API with the same suspicion you apply to the kernel API. Can you allocate, transfer, launch, and retrieve results without dropping into raw FFI? Does the runtime force you to understand CUDA internals just to move a buffer? If yes, that’s a warning. If no, you may actually have something usable.

The README’s quick start shows a CudaContext, a DeviceBuffer, a LaunchConfig, and a typed module loader. That’s the right shape. The host side should feel like a control plane, not a second programming language.

Async GPU work is where the project gets interesting

“for composable async GPU work, stream: disappears, {kernel}_async returns a lazy DeviceOperation , and execution happens when you call .sync() or .await”

What this actually means is that cuda-oxide is not just trying to compile kernels. It’s trying to model GPU work as something you can compose in Rust instead of manually sequencing every operation with streams everywhere. That’s a much better ambition than “we compiled to PTX, good luck.”

I like this because async GPU code is where a lot of otherwise decent abstractions fall apart. You can make a synchronous demo look elegant pretty quickly. The real pain starts when you need to overlap transfers, launches, and post-processing. Then every API decision shows up in your stack trace, your profiler, and your mood.

The README says stream: disappears in the async path and {kernel}_async returns a lazy DeviceOperation. That is a strong design signal. It means the project is trying to make GPU work look like a deferred computation rather than a pile of imperative stream calls. The async layer seems to be built around cuda-async, which is exactly where I’d want that complexity to live.

I’ve seen GPU codebases where stream management becomes the hidden architecture. Nobody planned for it, but suddenly every subsystem knows about streams, events, and sync points. That’s how accidental complexity spreads. A composable async layer can help, but only if it stays honest about what is actually deferred and what still has to be synchronized.

How to apply it: if you care about throughput, test async early. Don’t wait until the kernels are “done.” Build the smallest pipeline you can: upload, launch, download, validate. Then see whether the async abstraction lets you chain steps without turning the code into a state machine. If it does, that’s a real win. If it doesn’t, the abstraction is just a nicer costume on the same old stream management.

  • Check whether async operations compose cleanly across multiple kernels.
  • Look for a clear sync boundary, not hidden blocking.
  • Measure whether the async path makes your pipeline easier to reason about, not just faster.

The compiler pipeline is the real product

“Rust → Rust MIR → Pliron IR → LLVM IR → PTX”

What this actually means is that the project is not just shipping syntax sugar. It’s shipping a compiler architecture. That matters because compiler architecture is where the long-term maintainability either gets built in or gets kicked down the road until nobody wants to touch it.

The README also mentions running cargo oxide pipeline vecadd to show the full compilation path, which is exactly the kind of transparency I want from a compiler project. If you’re building on top of this, you need to know where lowering happens, where optimization happens, and where device-specific lowering starts. Otherwise every bug becomes a scavenger hunt.

I’m also paying attention to the toolchain requirements because they tell me how grounded the project is. The repo calls out pinned nightly Rust, CUDA Toolkit 12.x+, Clang, and LLVM 21 or newer for certain intrinsics. That’s not casual, but it is realistic. GPU compiler work tends to be environment-sensitive whether people admit it or not. I’d rather see a project tell me upfront that llc version matters than pretend my distro defaults are fine.

How to apply it: before you adopt cuda-oxide, check your willingness to own the toolchain. If your team hates pinned nightlies, CUDA installs, and LLVM version matching, this will annoy you. If your team already runs GPU builds in containers or Nix shells, the setup may actually be cleaner than your current stack.

The repo’s Nix setup is worth a look too: Nix plus flake-based dev shells can make a compiler project much less painful to reproduce. That is not a small detail. A GPU compiler that only works on one developer machine is not a compiler, it’s a hobby.

Safety is qualified here, and I think that’s the right call

“safe(ish), idiomatic Rust”

What this actually means is that the project is not promising magical memory safety on the GPU. It’s trying to keep as much of Rust’s discipline as possible while still dealing with device realities like shared memory, atomics, barriers, and warp ops. That “ish” is doing real work in the sentence, and I appreciate that.

The README lists device-side abstractions like type-safe indexing, shared memory, scoped atomics, barriers, TMA, and warp/cluster ops. That’s a broad surface area, and each one of those features can bite you if the abstraction gets too cute. The fact that the project is explicit about supporting operations like CUDA atomics and barriers tells me it understands the hardware side is not optional.

I’ve been around enough GPU abstractions to know that “safe” often means “we wrapped the sharp edge and hid the sharp edge somewhere else.” That’s not always bad. It just means the contract has to be honest. If the compiler can prevent some classes of bug while still letting you express warp-level behavior, that’s valuable. If it tries to eliminate all sharp edges, it will probably fail or get in your way.

How to apply it: treat cuda-oxide’s safety story as partial, not absolute. Read the book chapters on SIMT authoring, barriers, and atomics before you trust it with anything nontrivial. Then write tests around every kernel that relies on synchronization. Safety in GPU code is mostly about reducing the number of ways you can be wrong, not pretending wrong is impossible.

That’s also why I think the project’s examples matter so much. The README lists things like vector addition, generic kernels, device-side Ord::cmp lowering, GEMM, Blackwell tensor cores, and async pipelines. Those examples are not fluff. They show where the compiler already has to survive contact with real GPU patterns.

How I’d evaluate this before putting it in a repo

If I were adopting cuda-oxide tomorrow, I’d do it in this order: first, build a tiny vector-add or map kernel. Second, confirm host/device code sharing feels normal. Third, test one generic kernel with a closure capture. Fourth, test async composition if my workload needs it. Fifth, only then look at more advanced features like tensor cores, clusters, or device FFI.

I would not start with the most exotic example in the repo. That’s how people fool themselves into thinking a compiler is ready because one shiny demo worked. Start boring. If boring works, then the fancy stuff has a chance.

I’d also keep an eye on version pinning and platform support. The README says Linux is the tested target, with Ubuntu 24.04 called out. That’s useful, but it also means I should assume macOS and Windows are not where I’d start. The project is open about that, and I respect that more than fake portability.

One more thing I’d check is how much of my code would depend on the current macro and backend shape. The repo is still in active development, so API breakage is on the table. That’s fine for experimentation, dangerous for commitment. You want to know which layer is likely to churn: macros, runtime APIs, or lowering internals.

The template you can copy

# cuda-oxide adoption template

## 1) Project goal
I want to write GPU kernels in Rust and keep host + device code in one workspace.

## 2) First proof-of-life
Build one boring kernel first:
- vector add
- map over a slice
- reduction

## 3) Questions to answer before adoption
- Can I define a kernel as a normal Rust function with #[kernel]?
- Can host code launch it without raw CUDA FFI?
- Can I pass a closure or generic parameter cleanly?
- Can I move data with DeviceBuffer without custom glue?
- Does async GPU work compose without manual stream plumbing?

## 4) Minimal kernel pattern
use cuda_device::{cuda_module, kernel, thread, DisjointSlice};
use cuda_core::{CudaContext, DeviceBuffer, LaunchConfig};

#[cuda_module]
mod kernels {
    use super::*;

    #[kernel]
    pub fn map T>(f: F, input: &[T], mut out: DisjointSlice) {
        let idx = thread::index_1d();
        let i = idx.get();

        if let Some(out_elem) = out.get_mut(idx) {
            *out_elem = f(input[i]);
        }
    }
}

fn main() {
    let ctx = CudaContext::new(0).unwrap();
    let stream = ctx.default_stream();

    let data: Vec = (0..1024).map(|i| i as f32).collect();
    let input = DeviceBuffer::from_host(&stream, &data).unwrap();
    let mut output = DeviceBuffer::zeroed(&stream, 1024).unwrap();

    let module = kernels::load(&ctx).unwrap();
    let factor = 2.5f32;

    module.map::(
        &stream,
        LaunchConfig::for_num_elems(1024),
        move |x: f32| x * factor,
        &input,
        &mut output,
    ).unwrap();

    let result = output.to_host_vec(&stream).unwrap();
    assert!((result[1] - 2.5).abs() < 1e-5);
}

## 5) Adoption checklist
- [ ] Rust nightly pinned
- [ ] CUDA toolkit installed
- [ ] LLVM/llc version verified
- [ ] clang/libclang available for bindgen
- [ ] cargo oxide doctor passes
- [ ] one kernel compiles to PTX
- [ ] one host launch succeeds
- [ ] one async pipeline works

## 6) Rollout rule
If the compiler changes too fast, keep cuda-oxide behind a feature flag until the API settles.

## 7) What to document in your repo
- supported toolchain versions
- one working kernel example
- one async example
- known limitations
- fallback path to plain CUDA if needed

The main thing I’d copy from cuda-oxide is the idea that the compiler, runtime, and kernel model should all be visible to the developer. That’s the part that keeps GPU code from becoming folklore. I don’t want a black box with a nicer README. I want a toolchain I can inspect, test, and eventually trust.

Most of the value here is not “Rust on GPU” as a slogan. It’s the chance to keep one language, one workspace, and one mental model while still targeting PTX. If that’s the problem you actually have, this project is worth your time.

Source attribution: I based this breakdown on NVlabs/cuda-oxide and its README plus repository structure. The template above is my derivative summary of the project’s patterns, not copied text.