Skip to content

coderonion/zcuda

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

2 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

zCUDA

Write CUDA kernels in pure Zig. Or keep your C++ kernels. Ship them all to GPU.

zCUDA is a dual-ecosystem GPU programming framework for Zig:

  • πŸ”— CUDA Binding Layer β€” Complete, type-safe Zig bindings for CUDA C++ libraries (driver API, cuBLAS, cuDNN, cuFFT, cuSOLVER, cuSPARSE, cuRAND, NVRTC, NVTX, CUPTI, cuFile, and beyond). Seamlessly call existing, battle-tested CUDA C++ kernels (.cu files) from Zig β€” JIT compile via NVRTC or load pre-compiled PTX. Your mature C++ kernel ecosystem stays intact.
  • ⚑ CUDA Kernel DSL β€” A zero-overhead device-side API (zcuda_kernel) that compiles pure Zig directly to PTX at build time. No CUDA C++. No nvcc. Just Zig, all the way down to the CUDA registers β€” with full compile-time type safety, auto-discovery, and type-safe bridge generation.

The best of both worlds: Reuse the industry's vast collection of mature CUDA C++ kernels via NVRTC JIT or pre-compiled PTX β€” zero rewrite, zero migration friction. Write new custom kernels in pure Zig with auto-discovery, type-safe bridges, and zig build integration. Both paths produce PTX, both launch through the same stream.launch() API, and both ship in your final binary. Start with C++, migrate to Zig at your own pace, or keep both forever.

Why pure Zig? Zig brings a rare combination to GPU programming: C-level real-time performance (zero-cost abstractions, no GC, no hidden allocations, deterministic memory layout), Python-like readability (clean syntax, no header files, no forward declarations, no preprocessor macros. One unified toolchain: zig build handles build, test, package, and deploy β€” with built-in package manager (build.zig.zon, reproducible hashes, semantic versioning), built-in unit test runner, and first-class cross-compilation to any CUDA-capable platform (x86_64, aarch64, riscv64, and beyond) from any host. Compile-time type safety catches errors at build time and defer-based resource management.

Overview

Metric Value
Zig 0.16.0-dev.2535+b5bd49460
CUDA Toolkit 12.8
Modules 11 (driver, nvrtc, cublas, cublaslt, curand, cudnn, cusolver, cusparse, cufft, nvtx, cupti)
Tests 39 files (22 unit + 17 integration)
Host Examples 58 across 10 categories (50 run-xxx targets) + 24 integration
Kernel Examples 80 pure-Zig GPU kernels across 11 categories
GPU Validated βœ… sm_86 (RTX 3080 Ti, CUDA 12.8) β€” all unit + integration tests PASS, 50/50 run-xxx examples PASS, zcuda-demo All correctness PASSED (zig build test + zig build run EXIT=0)

Features

CUDA Binding Layer:

  • βœ… Type-safe β€” Idiomatic Zig API with compile-time type checking, Zig error unions instead of C error codes
  • βœ… Memory-safe β€” RAII-style resource management with defer, no leaked GPU memory, streams, or contexts
  • βœ… Zero-cost β€” Direct C API calls via @cImport, zero-overhead wrappers, no runtime reflection
  • βœ… Comprehensive β€” 11 CUDA library bindings (driver, nvrtc, cublas, cublaslt, curand, cudnn, cusolver, cusparse, cufft, nvtx, cupti) with full API coverage
  • βœ… Three-layer architecture β€” sys (raw FFI) β†’ result (error wrapping) β†’ safe (ergonomic user API)
  • βœ… Modular β€” Enable only the libraries you need via -D build flags, unused modules are zero-cost
  • βœ… C++ Compatible β€” Call existing CUDA C++ kernels via NVRTC JIT or pre-compiled PTX, zero migration friction

CUDA Kernel DSL:

  • βœ… Pure Zig Kernels β€” Write CUDA kernels in pure Zig with zcuda_kernel, compile to PTX at zig build time
  • βœ… Auto-discovery β€” Kernels detected by content (export fn), no manual registration or config files
  • βœ… Full device intrinsics β€” shared memory, atomics (atomicAdd, atomicCAS), warp shuffles (__shfl_sync, __shfl_xor_sync), __syncthreads, printf, thread/block/grid indexing
  • βœ… Tensor Core support β€” WMMA and MMA intrinsics for f16, bf16, tf32, int8, fp8 matrix operations
  • βœ… Type-safe bridge generation β€” Auto-generated Fn enum per kernel, function name typos caught at compile time

Ecosystem & Tooling:

  • βœ… 5 kernel loading methods β€” filesystem PTX, NVRTC JIT (inline), @embedFile JIT, @embedFile PTX, build.zig auto-generated bridge module
  • βœ… Hybrid Ready β€” Mix CUDA C++ and Zig kernels in the same project, both produce PTX, both use the same stream.launch() API
  • βœ… Cross-compilation β€” Target x86_64, aarch64, riscv64, and beyond from any host
  • βœ… Downstream export β€” Use as a Zig package via build.zig.zon, exposes pub const build_helpers for downstream kernel compilation
  • βœ… Built-in testing β€” comptime-verifiable unit tests, zig build test with no external test framework

Quick Start

Prerequisites

  • Zig 0.16.0-dev.2535+b5bd49460
  • CUDA Toolkit 12.x (with nvcc, libcuda, libcudart, libnvrtc)
  • cuDNN 9.x (optional, for cudnn module)
  • NVIDIA GPU with Compute Capability 7.0+ (Volta and later)

Build & Test

git clone https://github.com/coderonion/zcuda
cd zcuda

zig build                                    # Build library (driver + nvrtc + cublas + curand)
zig build test                               # Run all tests (unit + integration)
zig build test-unit                          # Unit tests only
zig build test-integration                   # Integration tests only

# Enable optional modules
zig build -Dcudnn=true -Dcusolver=true

# All modules
zig build -Dcublas=true -Dcublaslt=true -Dcurand=true -Dcudnn=true \
          -Dcusolver=true -Dcusparse=true -Dcufft=true -Dnvtx=true

Basic Usage

const std = @import("std");
const cuda = @import("zcuda");

pub fn main() !void {
    const allocator = std.heap.page_allocator;

    // Create a CUDA context on device 0
    const ctx = try cuda.driver.CudaContext.new(0);
    defer ctx.deinit();

    const stream = ctx.defaultStream();

    // Allocate and transfer data
    const host_data = [_]f32{ 1.0, 2.0, 3.0, 4.0 };
    const dev_data = try stream.cloneHtoD(f32, &host_data);
    defer dev_data.deinit();

    // Compile and launch a kernel
    const ptx = try cuda.nvrtc.compilePtx(allocator,
        \\extern "C" __global__ void add1(float *data, int n) {
        \\    int i = blockIdx.x * blockDim.x + threadIdx.x;
        \\    if (i < n) data[i] += 1.0f;
        \\}
    );
    defer allocator.free(ptx);

    const module = try ctx.loadModule(ptx);
    defer module.deinit();
    const kernel = try module.getFunction("add1");

    try stream.launch(kernel, cuda.LaunchConfig.forNumElems(4),
        .{ &dev_data, @as(i32, 4) });
    try stream.synchronize();

    // Read back results
    var result: [4]f32 = undefined;
    try stream.memcpyDtoH(f32, &result, dev_data);
    // result = { 2.0, 3.0, 4.0, 5.0 }
}

πŸ“¦ Use as Zig Package

Add zCUDA as a dependency β€” CUDA library linking, kernel compilation, and bridge generation are all handled automatically.

Tip

zcuda-demo is a fully worked reference project that imports zcuda as a local package. It covers NVRTC JIT, pure Zig GPU kernels (bridge module), cuBLAS SGEMM, cross-validation, and performance benchmarking. Clone it alongside zcuda to see a complete, runnable example of every step below:

git clone https://github.com/coderonion/zcuda
git clone https://github.com/coderonion/zcuda-demo
cd zcuda-demo && zig build run -Dgpu-arch=sm_86
# β†’ βœ“ All correctness checks PASSED (Sections 1–4)

Step 1: Add dependency to build.zig.zon

Local path (for development):

.dependencies = .{
    .zcuda = .{
        .path = "../zcuda",
    },
},

Git URL (for release):

.dependencies = .{
    .zcuda = .{
        .url = "https://github.com/coderonion/zcuda/archive/v0.1.0.tar.gz",
        .hash = "HASH_VALUE",
    },
},

Tip

How to get the hash: Add .url without .hash, then run zig build. Zig will display the correct hash in the error output β€” copy it into your build.zig.zon.

Step 2: Configure build.zig

A typical build.zig has three parts: 1. import zcuda, 2. discover & compile kernels, 3. wire everything to your executable.

1. Import zcuda dependency

Option A β€” Simple (hardcoded flags):

const zcuda = b.dependency("zcuda", .{
    .target    = target,
    .optimize  = optimize,
    .cublas    = true,   // cuBLAS     (default: true)
    .cublaslt  = true,   // cuBLAS LT  (default: true)
    .curand    = true,   // cuRAND     (default: true)
    .nvrtc     = true,   // NVRTC      (default: true)
    .cudnn     = false,  // cuDNN      (default: false)
    .cusolver  = false,  // cuSOLVER   (default: false)
    .cusparse  = false,  // cuSPARSE   (default: false)
    .cufft     = false,  // cuFFT      (default: false)
    .cupti     = false,  // CUPTI      (default: false)
    .cufile    = false,  // cuFile     (default: false)
    .nvtx      = false,  // NVTX       (default: false)
    // .@"cuda-path" = "/usr/local/cuda",  // optional: override auto-detect
});

Option B β€” Dynamic (forward to CLI):

Expose zcuda flags as your project's build options, so users can toggle modules at build time:

const enable_cublas   = b.option(bool, "cublas",   "Enable cuBLAS")   orelse true;
const enable_cublaslt = b.option(bool, "cublaslt", "Enable cuBLAS LT") orelse true;
const enable_curand   = b.option(bool, "curand",   "Enable cuRAND")   orelse true;
const enable_nvrtc    = b.option(bool, "nvrtc",    "Enable NVRTC")    orelse true;
const enable_cudnn    = b.option(bool, "cudnn",    "Enable cuDNN")    orelse false;
const enable_cusolver = b.option(bool, "cusolver", "Enable cuSOLVER") orelse false;
const enable_cusparse = b.option(bool, "cusparse", "Enable cuSPARSE") orelse false;
const enable_cufft    = b.option(bool, "cufft",    "Enable cuFFT")    orelse false;
const enable_nvtx     = b.option(bool, "nvtx",     "Enable NVTX")    orelse false;
const cuda_path       = b.option([]const u8, "cuda-path", "Path to CUDA installation (default: auto-detect)");

const zcuda = b.dependency("zcuda", .{
    .target    = target,
    .optimize  = optimize,
    .cublas    = enable_cublas,
    .cublaslt  = enable_cublaslt,
    .curand    = enable_curand,
    .nvrtc     = enable_nvrtc,
    .cudnn     = enable_cudnn,
    .cusolver  = enable_cusolver,
    .cusparse  = enable_cusparse,
    .cufft     = enable_cufft,
    .nvtx      = enable_nvtx,
    .@"cuda-path" = cuda_path,
});

2. Set up GPU architecture target

build_helpers provides ready-made helpers so you don't need to copy GPU arch boilerplate into every downstream project:

const bridge    = @import("zcuda").build_helpers;
const gpu_arch  = b.option([]const u8, "gpu-arch", "Target GPU SM arch (default: sm_80)") orelse "sm_80";
const embed_ptx = b.option(bool, "embed-ptx", "Embed PTX in binary") orelse false;

// Resolve nvptx64 target from "sm_XX" string (one call, no boilerplate)
const nvptx_target = bridge.resolveNvptxTarget(b, gpu_arch);

// Create the GPU-side device intrinsics module (compiled for nvptx64, not the host)
// sm_version build_options and internal paths are handled automatically
const device_mod = bridge.makeDeviceModule(b, zcuda_dep, nvptx_target, gpu_arch);

Supported SM versions: sm_52, sm_60, sm_70, sm_75, sm_80, sm_86, sm_89, sm_90, sm_100.

3. Discover & compile GPU kernels

const kernel_step = b.step("compile-kernels", "Compile Zig GPU kernels to PTX");
const kernel_dir  = b.option([]const u8, "kernel-dir",
    "Root dir for kernel discovery (default: src/kernel/)") orelse "src/kernel/";

// Recursively scan kernel directory for .zig files containing `export fn`
const kernels = bridge.discoverKernels(b, kernel_dir);

// Compile Zig β†’ PTX + generate type-safe bridge modules
const result = bridge.addBridgeModules(b, kernels, .{
    .embed_ptx        = embed_ptx,
    .zcuda_bridge_mod = zcuda_dep.module("zcuda_bridge"),
    .zcuda_mod        = zcuda_dep.module("zcuda"),
    .device_mod       = device_mod,
    .nvptx_target     = nvptx_target,
    .kernel_step      = kernel_step,
    .target           = target,
    .optimize         = optimize,
});

4. Create executable + wire imports

const exe = b.addExecutable(.{
    .name = "my_app",
    .root_module = b.createModule(.{
        .root_source_file = b.path("src/main.zig"),
        .target   = target,
        .optimize = optimize,
    }),
});

exe.root_module.addImport("zcuda", zcuda_dep.module("zcuda"));

// Link libc + CUDA libraries.
// Zig 0.16.0-dev does not propagate mod.linkSystemLibrary to downstream exe,
// so these must match the flags you passed to b.dependency("zcuda", ...).
exe.root_module.link_libc = true;
exe.root_module.linkSystemLibrary("cuda",   .{});
exe.root_module.linkSystemLibrary("cudart", .{});
// Add libraries matching your enabled dependency flags, e.g.:
// if (enable_nvrtc)    exe.root_module.linkSystemLibrary("nvrtc",    .{});
// if (enable_cublas)   exe.root_module.linkSystemLibrary("cublas",   .{});
// if (enable_cublaslt) exe.root_module.linkSystemLibrary("cublasLt", .{});
// if (enable_curand)   exe.root_module.linkSystemLibrary("curand",   .{});

// Mount kernel bridge modules (choose one):
//
// Option A β€” Mount all (single-exe projects):
for (result.modules) |entry| {
    exe.root_module.addImport(entry.name, entry.module);
    // For disk-mode (non-embedded) PTX, ensure PTX is installed before exe:
    if (entry.install_step) |s| b.getInstallStep().dependOn(s);
}
//
// Option B β€” Selective (multi-exe projects):
// if (bridge.findBridge(result.modules, "my_kernel")) |mod| {
//     exe.root_module.addImport("my_kernel", mod);
// }

b.installArtifact(exe);

const run_cmd = b.addRunArtifact(exe);
run_cmd.step.dependOn(b.getInstallStep());
b.step("run", "Build and run").dependOn(&run_cmd.step);

Note

linkSystemLibrary is not auto-propagated in Zig 0.16.0-dev. You must call linkSystemLibrary on your exe for each library you enabled in b.dependency("zcuda", ...). Using Option B (flag-forwarding) lets you keep these in sync automatically:

if (enable_cublas) exe.root_module.linkSystemLibrary("cublas", .{});

Tip

Mount all adds every discovered kernel bridge β€” ideal for single-app projects. Selective uses findBridge() to pick specific kernels β€” useful when multiple executables each need different kernels.

Step 3: Write your code

GPU kernel (zcuda_kernel)

Create .zig files anywhere under your kernel directory. Import zcuda_kernel for the full device-side API:

// kernels/my_kernel.zig
const cuda = @import("zcuda_kernel");

export fn myAdd(
    A: [*]const f32, B: [*]const f32, C: [*]f32, n: u32,
) callconv(.kernel) void {
    const i = cuda.blockIdx().x * cuda.blockDim().x + cuda.threadIdx().x;
    if (i < n) C[i] = A[i] + B[i];
}

Note

Detection is content-based: any .zig file containing export fn is auto-recognized as a kernel. No naming conventions or manual registration required.

zcuda_kernel API quick reference β€” naming matches CUDA C++ for seamless migration:

Category Zig (zcuda_kernel) CUDA C++ Equivalent
Thread Indexing cuda.threadIdx(), cuda.blockIdx(), cuda.blockDim(), cuda.gridDim() threadIdx.x, blockIdx.x, blockDim.x, gridDim.x
Synchronization cuda.__syncthreads(), cuda.__threadfence(), cuda.__syncwarp(mask) __syncthreads(), __threadfence(), __syncwarp()
Atomics cuda.atomicAdd(ptr, val), atomicCAS, atomicExch, atomicMin/Max, atomicAnd/Or/Xor, atomicInc/Dec atomicAdd(), atomicCAS(), etc.
Warp Shuffle cuda.__shfl_sync(mask, val, src, w), __shfl_down_sync, __shfl_up_sync, __shfl_xor_sync __shfl_sync(), __shfl_down_sync(), etc.
Warp Vote cuda.__ballot_sync(mask, pred), __all_sync, __any_sync, __activemask() __ballot_sync(), __all_sync(), etc.
Warp Reduce (sm_80+) cuda.__reduce_add_sync(mask, val), __reduce_min/max/and/or/xor_sync __reduce_add_sync(), etc.
Fast Math cuda.__sinf(x), __cosf, __expf, __logf, __log2f, rsqrtf, sqrtf, __fmaf_rn, __powf __sinf(), __cosf(), etc.
Integer cuda.__clz(x), __popc, __brev, __ffs, __byte_perm, __dp4a __clz(), __popc(), etc.
Cache Hints cuda.__ldg(ptr), __ldca, __ldcs, __ldcg, __stcg, __stcs, __stwb __ldg(), etc.
Shared Memory cuda.shared_mem.SharedArray(f32, 256), .dynamicShared(f32), .reduceSum(...) __shared__ float tile[256], extern __shared__
Tensor Cores (sm_70+) cuda.tensor_core.wmma_mma_f16_f32(a, b, c), mma_f16_f32, mma_bf16_f32, mma_tf32_f32 wmma::mma_sync(), mma PTX
Debug cuda.debug.assertf(cond), .assertInBounds(i, n), .CycleTimer, ErrorFlag assert(), __trap()
Shared Types cuda.shared.Vec2/Vec3/Vec4, .Matrix3x3, .Matrix4x4 float2/float3/float4
Clock cuda.clock(), cuda.clock64(), cuda.globaltimer() clock(), clock64()
Misc cuda.warpSize (32), cuda.FULL_MASK, cuda.SM, cuda.__nanosleep(ns) warpSize, __nanosleep()

Kernel examples:

const cuda = @import("zcuda_kernel");

// ── Vector addition ──
export fn vectorAdd(A: [*]const f32, B: [*]const f32, C: [*]f32, n: u32) callconv(.kernel) void {
    const i = cuda.blockIdx().x * cuda.blockDim().x + cuda.threadIdx().x;
    if (i < n) C[i] = A[i] + B[i];
}

Host application

Use zcuda bindings + kernel bridge together:

// src/main.zig
const std = @import("std");
const cuda = @import("zcuda");          // zcuda binding API
const my_kernel = @import("my_kernel"); // type-safe kernel bridge

pub fn main() !void {
    // ── Driver API (from zcuda bindings) ──
    var ctx = try cuda.driver.CudaContext.new(0);
    defer ctx.deinit();
    var stream = try ctx.newStream();
    defer stream.deinit();

    // ── Allocate GPU memory ──
    const n: u32 = 1024;
    var d_a = try stream.alloc(f32, n);
    defer d_a.deinit();
    var d_b = try stream.alloc(f32, n);
    defer d_b.deinit();
    var d_c = try stream.alloc(f32, n);
    defer d_c.deinit();

    // ── Load kernel (auto-detects embedded PTX vs disk file) ──
    const module = try my_kernel.load(ctx);
    defer module.deinit();

    // Function names are compile-time enums β€” typos cause build errors!
    const func = try my_kernel.getFunction(module, .myAdd);

    // ── Launch kernel ──
    try stream.launch(func,
        cuda.LaunchConfig.forNumElems(n),
        .{ d_a.ptr, d_b.ptr, d_c.ptr, n },
    );
    try stream.synchronize();

    // ── Read back results ──
    var result: [1024]f32 = undefined;
    try stream.memcpyDtoH(f32, &result, d_c);
}

Using CUDA C++ kernels (.cu files)

zCUDA fully supports existing CUDA C++ kernels β€” you don't need to rewrite anything in Zig. There are three ways to integrate C++ kernels:

Way 1 β€” NVRTC JIT compilation (inline C++ source):

Embed CUDA C++ source as a string in Zig and compile at runtime via NVRTC. Best for small kernels or rapid prototyping:

const std = @import("std");
const cuda = @import("zcuda");

pub fn main() !void {
    const allocator = std.heap.page_allocator;

    var ctx = try cuda.driver.CudaContext.new(0);
    defer ctx.deinit();
    const stream = ctx.defaultStream();

    // ── Existing CUDA C++ kernel, used as-is ──
    const cuda_cpp_source =
        \\extern "C" __global__ void saxpy(float a, float *x, float *y, int n) {
        \\    int i = blockIdx.x * blockDim.x + threadIdx.x;
        \\    if (i < n) y[i] = a * x[i] + y[i];
        \\}
    ;

    // JIT compile C++ β†’ PTX at runtime
    const ptx = try cuda.nvrtc.compilePtx(allocator, cuda_cpp_source);
    defer allocator.free(ptx);

    const module = try ctx.loadModule(ptx);
    defer module.deinit();
    const kernel = try module.getFunction("saxpy");

    // Allocate + launch, same API as Zig kernels
    const n: u32 = 1024;
    var d_x = try stream.alloc(f32, n);
    defer d_x.deinit();
    var d_y = try stream.alloc(f32, n);
    defer d_y.deinit();

    try stream.launch(kernel, cuda.LaunchConfig.forNumElems(n),
        .{ @as(f32, 2.0), d_x.ptr, d_y.ptr, @as(i32, @intCast(n)) });
    try stream.synchronize();
}

Way 2 β€” NVRTC JIT compilation (read .cu file from disk):

Load an existing .cu file and JIT compile it. Perfect for reusing large, mature C++ kernel libraries:

// Read existing .cu file β€” no modification needed
const cu_source = @embedFile("kernels/matmul_optimized.cu");

// Or load at runtime from any path:
// const cu_source = try std.fs.cwd().readFileAlloc(allocator, "vendor/kernels/matmul.cu", 1024 * 1024);
// defer allocator.free(cu_source);

const ptx = try cuda.nvrtc.compilePtx(allocator, cu_source);
defer allocator.free(ptx);

const module = try ctx.loadModule(ptx);
defer module.deinit();
const kernel = try module.getFunction("matmul_optimized");

// Launch with the same zcuda API
try stream.launch(kernel, .{ .grid = .{ .x = grid_x, .y = grid_y }, .block = .{ .x = 16, .y = 16 } },
    .{ d_A.ptr, d_B.ptr, d_C.ptr, @as(i32, @intCast(N)) });

Way 3 β€” Pre-compiled PTX (offline nvcc compilation):

Use nvcc to compile .cu β†’ .ptx offline, then load the PTX in Zig. Best for production or when you need nvcc-specific flags:

# Compile with nvcc (your existing build pipeline)
nvcc -ptx -arch=sm_80 -o matmul.ptx matmul.cu
// Load pre-compiled PTX at build time (embed in binary)
const ptx = @embedFile("matmul.ptx");
const module = try ctx.loadModule(ptx);
defer module.deinit();
const kernel = try module.getFunction("matmul");

// Same launch API β€” it's all PTX under the hood
try stream.launch(kernel, config, .{ d_A.ptr, d_B.ptr, d_C.ptr, @as(i32, @intCast(N)) });

Tip

Migration strategy: Start by wrapping your existing C++ kernels with Way 1 or 2 β€” zero rewrite needed. Then gradually port performance-critical kernels to pure Zig (Step 3 above) and enjoy compile-time type safety and zig build integration. Both C++ and Zig kernels produce PTX, both use the same stream.launch() API, and both can coexist in the same project.

Step 4: Build & Run

Common commands

zig build                    # Build with defaults (driver + nvrtc + cublas + curand)
zig build run                # Build & run your application
zig build test               # Run all tests

CUDA module flags

Enable/disable CUDA library bindings at build time. These flags control which libraries get linked:

Flag Default Module Description
-Dcublas=BOOL true cuBLAS BLAS Level 1/2/3 (SAXPY, SGEMM, DGEMM, etc.)
-Dcublaslt=BOOL true cuBLAS LT Lightweight GEMM with algorithm heuristics
-Dcurand=BOOL true cuRAND GPU random number generation
-Dnvrtc=BOOL true NVRTC Runtime kernel compilation
-Dcudnn=BOOL false cuDNN Convolution, activation, pooling, softmax, batch norm
-Dcusolver=BOOL false cuSOLVER LU, QR, SVD, Cholesky, eigenvalue decomposition
-Dcusparse=BOOL false cuSPARSE SpMV, SpMM, SpGEMM with CSR/COO formats
-Dcufft=BOOL false cuFFT 1D/2D/3D Fast Fourier Transform
-Dcupti=BOOL false CUPTI Profiling and tracing via CUDA Profiling Tools Interface
-Dcufile=BOOL false cuFile GPUDirect Storage for direct GPU↔storage I/O
-Dnvtx=BOOL false NVTX Profiling annotations for Nsight

Driver API is always enabled (no flag needed).

zig build                                  # defaults (cublas + cublaslt + curand + nvrtc)
zig build -Dcudnn=true -Dcusolver=true     # add cuDNN + cuSOLVER
zig build -Dcublas=false                   # disable cuBLAS
zig build -Dcublas=true -Dcudnn=true -Dcufft=true  # multi-module combo

GPU kernel flags

Control kernel compilation and PTX handling:

Flag Default Description
-Dgpu-arch=ARCH sm_80 Target GPU architecture (e.g. sm_75, sm_89, sm_90)
-Dembed-ptx=BOOL false Embed PTX in binary β€” no .ptx files needed at runtime
-Dkernel-dir=PATH src/kernel/ Root directory for kernel auto-discovery
zig build compile-kernels                     # Compile all Zig kernels to PTX
zig build kernel-my_kernel                    # Compile single kernel only
zig build compile-kernels -Dgpu-arch=sm_80    # Target Ampere GPUs
zig build -Dembed-ptx=true                    # Production: PTX baked into binary
zig build compile-kernels -Dkernel-dir=my/kernels/  # Custom Zig kernels location

PTX output: zig-out/bin/kernel/*.ptx

Path overrides

Flag Default Description
-Dcuda-path=PATH auto-detect Path to CUDA toolkit installation
-Dkernel-dir=PATH src/kernel/ Root directory for kernel auto-discovery

πŸ’‘ Combining flags: Module flags and kernel flags are orthogonal β€” use them together freely:

zig build compile-kernels -Dgpu-arch=sm_80 -Dcublas=true -Dcufft=true -Dembed-ptx=true

Examples

Host Examples (58 files, 74 build targets)

Working host-side examples in the examples/ directory. See examples/README.md for the full categorized index.

# Build and run
zig build run-basics-vector_add
zig build run-cublas-gemm
zig build run-cusolver-gesvd -Dcusolver=true
zig build run-cudnn-conv2d -Dcudnn=true
zig build run-cufft-fft_2d -Dcufft=true
Category Count Examples What You'll Learn
Basics 16 vector_add, streams, device_info, alloc_patterns, async_memcpy, pinned_memory, unified_memory, … Contexts, streams, events, kernels, multi-GPU, memory patterns
cuBLAS 19 gemm, axpy, trsm, cosine_similarity, gemm_batched, gemm_ex, dgmm, … L1/L2/L3 BLAS, batched GEMM, mixed-precision
cuDNN 3 conv2d, activation, pooling_softmax Neural network primitives
cuFFT 4 fft_1d_c2c, fft_2d, fft_3d, fft_1d_r2c 1D/2D/3D FFT, filtering
cuRAND 3 distributions, generators, monte_carlo_pi RNG types, Monte Carlo
cuSOLVER 5 getrf, gesvd, potrf, syevd, geqrf LU, SVD, Cholesky, QR, eigensolve
cuSPARSE 4 spmv_csr, spmm_csr, spmv_coo, spgemm CSR/COO SpMV, SpMM, SpGEMM
cuBLAS LT 1 lt_sgemm GEMM with algorithm heuristics
NVRTC 2 jit_compile, template_kernel Runtime compilation
NVTX 1 profiling Nsight annotations

Kernel Examples (80 files, 11 categories)

GPU kernel source files in examples/kernel/, organized by difficulty and feature:

Category Count Kernels Features Demonstrated
0_Basic 8 vector_add, saxpy, grid_stride, dot_product, relu, scale_bias, residual_norm, vec3_normalize Thread indexing, grid-stride loops, elementwise ops
1_Reduction 5 reduce_sum, reduce_warp, reduce_multiblock, prefix_sum, scalar_product Parallel reduction patterns
2_Matrix 6 matmul_naive, matmul_tiled, matvec, transpose, extract_diag, pad_2d Matrix operations, tiling, 2D indexing
3_Atomics 5 histogram, histogram_256bin, atomic_ops, system_atomics, warp_aggregated_atomics Atomic operations, histogramming
4_SharedMemory 3 shared_mem_demo, dynamic_smem, stencil_1d Static/dynamic shared memory, stencils
5_Warp 5 warp shuffle, vote, reduce, ballot, cooperative Warp-level primitives
6_MathAndTypes 8 fast_math, half_precision, integer_intrinsics, type_conversion, sigmoid, signal_gen, complex_mul, freq_filter Math intrinsics, type conversion, signal processing
7_Debug 2 debug kernels Debug assertions, error flags
8_TensorCore 6 WMMA GEMM, fragment ops Tensor core operations (sm_70+)
9_Advanced 8 softmax, async_copy, cooperative_groups, thread_fence, gbm_paths, particle_init/step, intrinsics_coverage Advanced patterns, simulation
10_Integration 24 Driver lifecycle, streams, CUDA graphs, cuBLAS pipelines, cuFFT pipelines, cuRAND apps, end-to-end, tensor core pipelines, perf benchmark (Zig vs cuBLAS) Multi-library integration, end-to-end workflows, performance comparison

Testing

zig build test                               # All tests (unit + integration)
zig build test-unit                          # Unit tests only
zig build test-integration                   # Integration tests only

# Enable all optional module tests
zig build test -Dcudnn=true -Dcusolver=true -Dcusparse=true -Dcufft=true -Dnvtx=true

Test Summary

Category Files Tests Requires GPU
Unit β€” kernel types (pure Zig) 10 222 ❌
Unit β€” core (driver, runtime, nvrtc) 4 β€” βœ…
Unit β€” conditional (cublas, cudnn, etc.) 8 β€” βœ…
Integration β€” core 10 β€” βœ…
Integration β€” kernel GPU 7 β€” βœ…
Total 39 222+

Note

222 pure-Zig tests verify kernel DSL types, compile-time logic, and memory layouts β€” they run on any machine without CUDA. GPU-dependent tests require CUDA libraries and will fail to link on macOS or systems without CUDA installed.

Test Architecture

The 10 kernel type test files (test/unit/kernel/) test pure Zig data structures that don't call any CUDA APIs:

  • kernel_shared_types_test β€” Vec2/3/4, Matrix, LaunchConfig struct layout
  • kernel_types_test β€” SharedMemory type tag validation
  • kernel_arch_test β€” SM architecture enum & feature tables
  • kernel_device_types_test β€” DevicePtr, GridStrideIterator struct verification
  • kernel_debug_test β€” Printf format string compile-time logic
  • kernel_shared_mem_test β€” SharedMemory compile-time meta-info
  • kernel_intrinsics_host_test β€” Intrinsic function signatures & type inference
  • kernel_tensor_core_host_test β€” WMMA fragment type definitions
  • kernel_grid_stride_test β€” GridStrideIterator field validation
  • kernel_device_test β€” Device pointer load/store patterns

Architecture

Each module follows a consistent three-layer design:

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  Safe Layer (safe.zig)                       β”‚  ← Recommended API
β”‚  Type-safe abstractions, RAII, Zig idioms    β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚  Result Layer (result.zig)                   β”‚  ← Error wrapping
β”‚  C error codes β†’ Zig error unions            β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚  Sys Layer (sys.zig)                         β”‚  ← Raw FFI
β”‚  Direct @cImport of C headers               β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Project Structure

zcuda/
β”œβ”€β”€ src/                           # Zig API layer (11 modules)
β”‚   β”œβ”€β”€ cuda.zig                   # Root module β€” re-exports all modules
β”‚   β”œβ”€β”€ types.zig                  # Shared types (Dim3, LaunchConfig, DevicePtr)
β”‚   β”œβ”€β”€ driver/                    # CUDA Driver API (sys, result, safe)
β”‚   β”œβ”€β”€ nvrtc/                     # NVRTC (runtime compilation)
β”‚   β”œβ”€β”€ kernel/                    # GPU Kernel DSL (pure Zig β†’ PTX, no CUDA C++ needed)
β”‚   β”‚   β”œβ”€β”€ device.zig             #   Module root (re-exports all sub-modules)
β”‚   β”‚   β”œβ”€β”€ types.zig              #   DeviceSlice(T), DevicePtr(T), GridStrideIterator
β”‚   β”‚   β”œβ”€β”€ shared_types.zig       #   Vec2/3/4, Int2/3, Matrix3x3/4x4, LaunchConfig
β”‚   β”‚   β”œβ”€β”€ arch.zig               #   SmVersion enum, requireSM comptime guard
β”‚   β”‚   β”œβ”€β”€ intrinsics.zig         #   98 inline fns: threadIdx, atomics, warp, math, cache
β”‚   β”‚   β”œβ”€β”€ shared_mem.zig         #   SharedArray(T,N), dynamicShared, cooperative utils
β”‚   β”‚   β”œβ”€β”€ tensor_core.zig        #   56 inline fns: WMMA/MMA/wgmma/TMA/cluster/tcgen05
β”‚   β”‚   β”œβ”€β”€ bridge_gen.zig         #   Type-safe kernel bridge (Fn enum, load, getFunction)
β”‚   β”‚   └── debug.zig              #   assertf, ErrorFlag, printf, CycleTimer, __trap
β”‚   └── ...                        # 8 more binding modules (cublas, cudnn, ...)
β”œβ”€β”€ examples/                      # 58 host-side examples
β”‚   β”œβ”€β”€ basics/                    #   16 fundamentals (vector_add, streams, ...)
β”‚   β”œβ”€β”€ cublas/                    #   19 BLAS examples
β”‚   β”œβ”€β”€ cudnn/                     #   3 neural network examples
β”‚   β”œβ”€β”€ cufft/                     #   4 FFT examples
β”‚   β”œβ”€β”€ curand/                    #   3 RNG examples
β”‚   β”œβ”€β”€ cusolver/                  #   5 linear algebra examples
β”‚   β”œβ”€β”€ cusparse/                  #   4 sparse matrix examples
β”‚   β”œβ”€β”€ cublaslt/                  #   1 LT GEMM example
β”‚   β”œβ”€β”€ nvrtc/                     #   2 JIT compilation examples
β”‚   β”œβ”€β”€ nvtx/                      #   1 profiling example
β”‚   └── kernel/                    #   79 GPU kernel source files
β”‚       β”œβ”€β”€ 0_Basic/               #     8 basic kernels
β”‚       β”œβ”€β”€ 1_Reduction/           #     5 reduction kernels
β”‚       β”œβ”€β”€ 2_Matrix/              #     6 matrix kernels
β”‚       β”œβ”€β”€ 3_Atomics/             #     5 atomic kernels
β”‚       β”œβ”€β”€ 4_SharedMemory/        #     3 shared memory kernels
β”‚       β”œβ”€β”€ 5_Warp/                #     5 warp-level kernels
β”‚       β”œβ”€β”€ 6_MathAndTypes/        #     8 math/type kernels
β”‚       β”œβ”€β”€ 7_Debug/               #     2 debug kernels
β”‚       β”œβ”€β”€ 8_TensorCore/          #     6 tensor core kernels
β”‚       β”œβ”€β”€ 9_Advanced/            #     8 advanced kernels
β”‚       └── 10_Integration/        #     24 integration host examples
β”œβ”€β”€ test/                          # 39 test files
β”‚   β”œβ”€β”€ unit/                      #   12 core unit tests
β”‚   β”‚   └── kernel/                #   10 kernel type tests (pure Zig, no GPU)
β”‚   └── integration/               #   10 core integration tests
β”‚       └── kernel/                #   7 kernel GPU integration tests
β”œβ”€β”€ docs/                          # Comprehensive API documentation
β”œβ”€β”€ build.zig                      # Build configuration + pub const build_helpers
β”œβ”€β”€ build_helpers.zig              # Kernel discovery, PTX compilation, bridge generation
β”œβ”€β”€ build.zig.zon                  # Package manifest
└── BUG_TRACKER.md                 # Known issues and fixes

Users should only use the Safe Layer. The result and sys layers are implementation details β€” all public types and functions are re-exported from each module's top-level file.

Build System

The build system (build.zig + build_helpers.zig) provides:

  • Auto-detection: CUDA installation path discovery across common locations
  • Modular linking: Only link libraries you enable via -D flags
  • Kernel pipeline: discoverKernels() β†’ content-based scan β†’ addBridgeModules() β†’ Zigβ†’PTX compilation + type-safe bridge generation
  • Downstream export: pub const build_helpers = @import("build_helpers.zig") β€” downstream packages access via @import("zcuda").build_helpers
  • Configurable paths: -Dkernel-dir, -Dcuda-path, -Dgpu-arch, -Dembed-ptx

Documentation

Comprehensive documentation is available in the docs/ directory:

Each module has its own detailed README in docs/<module>/README.md.

Contributing

  1. ⭐ Star and Fork this repository
  2. Create a feature branch (git checkout -b feature/new-module)
  3. Implement sys/result/safe layers in src/<module>/
  4. Add unit tests in test/unit/ and integration tests in test/integration/
  5. Create a host example in examples/<module>/
  6. Add kernel examples in examples/kernel/<category>/
  7. Update documentation in docs/<module>/
  8. Submit a Pull Request

License

MIT License

Acknowledgments

Built with gratitude on the shoulders of giants:

  • Zig β€” A modern systems programming language focused on safety, performance, and simplicity, created by Andrew Kelley and the Zig Software Foundation.

  • CUDA β€” NVIDIA's parallel computing platform and API, providing the underlying runtime, compiler, and libraries.

  • cudarc β€” A safe Rust wrapper for CUDA whose three-layer architecture (sys β†’ result β†’ safe) served as the foundational reference for this project.

About

zCUDA: πŸ”— CUDA Binding Layer | ⚑ CUDA Kernel DSL. Write CUDA kernels in pure Zig. Or keep your C++ kernels. Ship them all to GPU.

Topics

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages