A comprehensive collection of 159 examples demonstrating every module and capability of the zCUDA library. All host examples use the safe API layer exclusively.
| Category | Examples | Notes |
|---|---|---|
| Basics (Driver API) | 16 | Contexts, streams, memory, events |
| cuBLAS | 19 | BLAS L1/L2/L3 + mixed-precision |
| cuBLAS LT | 1 | Lightweight BLAS with heuristics |
| cuDNN | 3 | Conv, activation, pooling |
| cuFFT | 4 | 1D/2D/3D complex and real FFTs |
| cuRAND | 3 | GPU random number generation |
| cuSOLVER | 5 | LU, QR, Cholesky, SVD, eigenvalue |
| cuSPARSE | 4 | SpMV (CSR/COO), SpMM, SpGEMM |
| NVRTC | 2 | JIT kernel compilation |
| NVTX | 1 | Profiling annotations |
| Kernel DSL — all categories | 80 | Pure-Zig GPU kernels (11 categories) |
| Integration | 24 | End-to-end pipelines and benchmarks |
| Total | 162 | — |
# Run a host example (run-<category>-<name>)
zig build run-basics-vector_add
zig build run-cublas-gemm -Dcublas=true
zig build run-cudnn-conv2d -Dcudnn=true
# Build a specific kernel example (example-kernel-<cat>-<name>)
zig build example-kernel-0-basic-kernel_vector_add -Dgpu-arch=sm_86
# Build all integration examples at once
zig build example-integration -Dcublas=true -Dcufft=true ...
# Run all integration binaries
zig-out/bin/integration-<name>Library flags: -Dcublas=true, -Dcublaslt=true, -Dcudnn=true, -Dcufft=true,
-Dcurand=true, -Dcusolver=true, -Dcusparse=true, -Dnvtx=true.
Basics — CUDA Driver API (16 examples)
Core GPU programming: contexts, streams, memory management, events, kernels.
| Example | Description | Run Command |
|---|---|---|
| vector_add | Vector addition via JIT kernel | run-basics-vector_add |
| device_info | GPU specs: memory, compute, features | run-basics-device_info |
| event_timing | Event-based timing & bandwidth measurement | run-basics-event_timing |
| streams | Multi-stream concurrent execution | run-basics-streams |
| peer_to_peer | Multi-GPU peer access and cross-device copy | run-basics-peer_to_peer |
| constant_memory | GPU constant memory for polynomial eval | run-basics-constant_memory |
| struct_kernel | Pass Zig extern struct to GPU kernel |
run-basics-struct_kernel |
| kernel_attributes | Query kernel registers, shared mem, occupancy | run-basics-kernel_attributes |
| alloc_patterns | Device, host, pinned, and unified allocation | run-basics-alloc_patterns |
| async_memcpy | Async H2D/D2H transfers with streams | run-basics-async_memcpy |
| pinned_memory | Pinned (page-locked) memory for faster transfers | run-basics-pinned_memory |
| unified_memory | Unified memory (UM) migration and access | run-basics-unified_memory |
| context_lifecycle | Context creation, binding, and destruction | run-basics-context_lifecycle |
| dtod_copy_chain | Device-to-device chained copy pipeline | run-basics-dtod_copy_chain |
| memset_patterns | Device memset patterns and initialization | run-basics-memset_patterns |
| multi_device_query | Enumerate and query all CUDA devices | run-basics-multi_device_query |
cuBLAS — Dense Linear Algebra (19 examples)
BLAS Level 1, 2, and 3 operations. Enable with -Dcublas=true.
| Example | Description | Run Command |
|---|---|---|
| axpy | SAXPY: y = α·x + y | run-cublas-axpy -Dcublas=true |
| dot | Dot product | run-cublas-dot -Dcublas=true |
| nrm2_asum | L1 and L2 vector norms | run-cublas-nrm2_asum -Dcublas=true |
| scal | Vector scaling: x = α·x | run-cublas-scal -Dcublas=true |
| amax_amin | Index of max/min absolute value | run-cublas-amax_amin -Dcublas=true |
| swap_copy | Vector swap and copy | run-cublas-swap_copy -Dcublas=true |
| rot | Givens rotation | run-cublas-rot -Dcublas=true |
| cosine_similarity | Cosine similarity via L1 ops | run-cublas-cosine_similarity -Dcublas=true |
| Example | Description | Run Command |
|---|---|---|
| gemv | Matrix-vector multiply (SGEMV) | run-cublas-gemv -Dcublas=true |
| symv_syr | Symmetric matrix-vector ops | run-cublas-symv_syr -Dcublas=true |
| trmv_trsv | Triangular multiply and solve | run-cublas-trmv_trsv -Dcublas=true |
| Example | Description | Run Command |
|---|---|---|
| gemm | Matrix-matrix multiply (SGEMM) | run-cublas-gemm -Dcublas=true |
| gemm_batched | Strided batched GEMM | run-cublas-gemm_batched -Dcublas=true |
| gemm_ex | Mixed-precision GemmEx | run-cublas-gemm_ex -Dcublas=true |
| symm | Symmetric matrix multiply | run-cublas-symm -Dcublas=true |
| trsm | Triangular solve (STRSM) | run-cublas-trsm -Dcublas=true |
| syrk | Symmetric rank-k update | run-cublas-syrk -Dcublas=true |
| geam | Matrix add / transpose | run-cublas-geam -Dcublas=true |
| dgmm | Diagonal matrix multiply | run-cublas-dgmm -Dcublas=true |
cuBLAS LT — Lightweight BLAS (1 example)
Advanced GEMM with algorithm heuristics and mixed-precision support. Enable with -Dcublaslt=true.
| Example | Description | Run Command |
|---|---|---|
| lt_sgemm | SGEMM with heuristic algorithm selection | run-cublaslt-lt_sgemm -Dcublaslt=true |
cuDNN — Deep Neural Networks (3 examples)
Neural network primitives: convolution, activation, pooling, softmax. Enable with -Dcudnn=true.
| Example | Description | Run Command |
|---|---|---|
| activation | ReLU, sigmoid, tanh activation functions | run-cudnn-activation -Dcudnn=true |
| pooling_softmax | Max pooling + softmax pipeline | run-cudnn-pooling_softmax -Dcudnn=true |
| conv2d | 2D convolution forward pass | run-cudnn-conv2d -Dcudnn=true |
cuFFT — Fast Fourier Transform (4 examples)
1D, 2D, and 3D FFTs with complex and real data. Enable with -Dcufft=true.
| Example | Description | Run Command |
|---|---|---|
| fft_1d_c2c | 1D complex-to-complex FFT | run-cufft-fft_1d_c2c -Dcufft=true |
| fft_1d_r2c | 1D real-to-complex with frequency filtering | run-cufft-fft_1d_r2c -Dcufft=true |
| fft_2d | 2D complex FFT | run-cufft-fft_2d -Dcufft=true |
| fft_3d | 3D complex FFT | run-cufft-fft_3d -Dcufft=true |
cuRAND — Random Number Generation (3 examples)
GPU-accelerated random number generation. Enable with -Dcurand=true.
| Example | Description | Run Command |
|---|---|---|
| distributions | Uniform, normal, Poisson distributions | run-curand-distributions -Dcurand=true |
| generators | Generator comparison (XORWOW, MRG32k3a, …) | run-curand-generators -Dcurand=true |
| monte_carlo_pi | Monte Carlo π estimation | run-curand-monte_carlo_pi -Dcurand=true |
cuSOLVER — Dense Solvers (5 examples)
LU, QR, Cholesky, SVD, and eigenvalue decomposition. Enable with -Dcusolver=true.
Note:
devInfois a GPU-side pointer (CudaSlice(i32)) per cuSOLVER API contract. Usestream.memcpyDtoHafterctx.synchronize()to read it on the host.
| Example | Description | Run Command |
|---|---|---|
| getrf | LU factorization (PA = LU) + linear solve | run-cusolver-getrf -Dcusolver=true |
| gesvd | Singular value decomposition (A = UΣVᵀ) | run-cusolver-gesvd -Dcusolver=true |
| potrf | Cholesky factorization (A = LLᵀ) + solve | run-cusolver-potrf -Dcusolver=true |
| syevd | Symmetric eigenvalue decomposition | run-cusolver-syevd -Dcusolver=true |
| geqrf | QR factorization + Q extraction | run-cusolver-geqrf -Dcusolver=true |
cuSPARSE — Sparse Linear Algebra (4 examples)
Sparse matrix operations with CSR, COO, and SpGEMM formats. Enable with -Dcusparse=true.
| Example | Description | Run Command |
|---|---|---|
| spmv_csr | Sparse matrix-vector multiply (CSR) | run-cusparse-spmv_csr -Dcusparse=true |
| spmv_coo | Sparse matrix-vector multiply (COO) | run-cusparse-spmv_coo -Dcusparse=true |
| spmm_csr | Sparse × dense matrix multiply | run-cusparse-spmm_csr -Dcusparse=true |
| spgemm | Sparse × sparse matrix multiply | run-cusparse-spgemm -Dcusparse=true |
NVRTC — Runtime Compilation (2 examples)
Just-in-time CUDA kernel compilation.
| Example | Description | Run Command |
|---|---|---|
| jit_compile | Runtime CUDA C++ → PTX compilation | run-nvrtc-jit_compile |
| template_kernel | Multi-kernel pipeline with templated types | run-nvrtc-template_kernel |
NVTX — Profiling Annotations (1 example)
Nsight-compatible range markers.
| Example | Description | Run Command |
|---|---|---|
| profiling | Range push/pop and point mark annotations | run-nvtx-profiling -Dnvtx=true |
All kernels are written in pure Zig and compiled to PTX via Zig's built-in LLVM NVPTX backend. See kernel/README.md for the full index.
| Category | Examples | Topics |
|---|---|---|
| 0_Basic | 8 | SAXPY, RELU, dot, grid-stride, normalize |
| 1_Reduction | 5 | Warp reduce, multi-block, prefix sum, scalar product |
| 2_Matrix | 6 | Naive & tiled matmul, matvec, transpose, pad, diag |
| 3_Atomics | 5 | Atomic ops, histograms, warp-aggregated atomics |
| 4_SharedMemory | 3 | Dynamic SMEM, 1D stencil, shared mem demo |
| 5_Warp | 5 | Ballot, broadcast, match, reduce, scan |
| 6_MathAndTypes | 9 | FP16, complex, FFT filter, fast math, type conversion |
| 7_Debug | 2 | Error checking, printf debug from GPU |
| 8_TensorCore | 11 | WMMA (f16/bf16/int8/tf32), MMA (f16/fp8) |
| 9_Advanced | 8 | Async copy pipeline, cooperative groups, softmax |
| 10_Integration | 24 | End-to-end pipelines and benchmarks |
End-to-end integration examples using Zig kernels with CUDA libraries.
# Build all integration examples
zig build example-integration -Dgpu-arch=sm_86 -Dcublas=true -Dcufft=true ...
# Run a specific binary
./zig-out/bin/integration-<name>| Binary | Description |
|---|---|
| integration-module-load-launch | Driver lifecycle: PTX load + kernel launch |
| integration-ptx-compile-execute | NVRTC compile + execute pipeline |
| integration-stream-callback | Stream callback pattern (event-driven) |
| integration-stream-concurrency | Multi-stream concurrent execution |
| integration-basic-graph | CUDA Graph basics: capture and replay |
| integration-graph-replay-update | Graph replay with node update |
| integration-graph-with-deps | Graph with explicit dependencies |
| integration-scale-bias-gemm | cuBLAS Scale+Bias→GEMM→ReLU pipeline |
| integration-residual-gemm | Residual connection with GEMM |
| integration-error-recovery | CUDA error recovery patterns |
| integration-oob-launch | Out-of-bounds launch detection |
| integration-fft-filter | FFT-based filter pipeline |
| integration-conv2d-fft | 2D convolution via FFT |
| integration-occupancy-calc | Occupancy calculator utilities |
| integration-monte-carlo-option | Monte Carlo option pricing (GPU) |
| integration-particle-system | Particle system simulation |
| integration-matmul-e2e | Matrix multiply end-to-end |
| integration-reduction-e2e | Reduction end-to-end |
| integration-saxpy-e2e | SAXPY end-to-end |
| integration-multi-library | Multi-library pipeline (cuBLAS + cuDNN + cuFFT) |
| integration-wmma-gemm-verify | WMMA GEMM correctness verification |
| integration-attention-pipeline | Attention pipeline (QK^T, softmax, V) |
| integration-mixed-precision-train | Mixed-precision training pipeline (FP16+TF32) |
| integration-perf-benchmark | Zig kernel vs cuBLAS (event-timed benchmark) |