Skip to content

Latest commit

 

History

History

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

README.md

zCUDA Examples

A comprehensive collection of 159 examples demonstrating every module and capability of the zCUDA library. All host examples use the safe API layer exclusively.

Overview

Category Examples Notes
Basics (Driver API) 16 Contexts, streams, memory, events
cuBLAS 19 BLAS L1/L2/L3 + mixed-precision
cuBLAS LT 1 Lightweight BLAS with heuristics
cuDNN 3 Conv, activation, pooling
cuFFT 4 1D/2D/3D complex and real FFTs
cuRAND 3 GPU random number generation
cuSOLVER 5 LU, QR, Cholesky, SVD, eigenvalue
cuSPARSE 4 SpMV (CSR/COO), SpMM, SpGEMM
NVRTC 2 JIT kernel compilation
NVTX 1 Profiling annotations
Kernel DSL — all categories 80 Pure-Zig GPU kernels (11 categories)
Integration 24 End-to-end pipelines and benchmarks
Total 162

Building & Running

# Run a host example (run-<category>-<name>)
zig build run-basics-vector_add
zig build run-cublas-gemm -Dcublas=true
zig build run-cudnn-conv2d -Dcudnn=true

# Build a specific kernel example (example-kernel-<cat>-<name>)
zig build example-kernel-0-basic-kernel_vector_add -Dgpu-arch=sm_86

# Build all integration examples at once
zig build example-integration -Dcublas=true -Dcufft=true ...

# Run all integration binaries
zig-out/bin/integration-<name>

Library flags: -Dcublas=true, -Dcublaslt=true, -Dcudnn=true, -Dcufft=true, -Dcurand=true, -Dcusolver=true, -Dcusparse=true, -Dnvtx=true.


Core GPU programming: contexts, streams, memory management, events, kernels.

Example Description Run Command
vector_add Vector addition via JIT kernel run-basics-vector_add
device_info GPU specs: memory, compute, features run-basics-device_info
event_timing Event-based timing & bandwidth measurement run-basics-event_timing
streams Multi-stream concurrent execution run-basics-streams
peer_to_peer Multi-GPU peer access and cross-device copy run-basics-peer_to_peer
constant_memory GPU constant memory for polynomial eval run-basics-constant_memory
struct_kernel Pass Zig extern struct to GPU kernel run-basics-struct_kernel
kernel_attributes Query kernel registers, shared mem, occupancy run-basics-kernel_attributes
alloc_patterns Device, host, pinned, and unified allocation run-basics-alloc_patterns
async_memcpy Async H2D/D2H transfers with streams run-basics-async_memcpy
pinned_memory Pinned (page-locked) memory for faster transfers run-basics-pinned_memory
unified_memory Unified memory (UM) migration and access run-basics-unified_memory
context_lifecycle Context creation, binding, and destruction run-basics-context_lifecycle
dtod_copy_chain Device-to-device chained copy pipeline run-basics-dtod_copy_chain
memset_patterns Device memset patterns and initialization run-basics-memset_patterns
multi_device_query Enumerate and query all CUDA devices run-basics-multi_device_query

BLAS Level 1, 2, and 3 operations. Enable with -Dcublas=true.

Level 1 — Vector–Vector Operations

Example Description Run Command
axpy SAXPY: y = α·x + y run-cublas-axpy -Dcublas=true
dot Dot product run-cublas-dot -Dcublas=true
nrm2_asum L1 and L2 vector norms run-cublas-nrm2_asum -Dcublas=true
scal Vector scaling: x = α·x run-cublas-scal -Dcublas=true
amax_amin Index of max/min absolute value run-cublas-amax_amin -Dcublas=true
swap_copy Vector swap and copy run-cublas-swap_copy -Dcublas=true
rot Givens rotation run-cublas-rot -Dcublas=true
cosine_similarity Cosine similarity via L1 ops run-cublas-cosine_similarity -Dcublas=true

Level 2 — Matrix–Vector Operations

Example Description Run Command
gemv Matrix-vector multiply (SGEMV) run-cublas-gemv -Dcublas=true
symv_syr Symmetric matrix-vector ops run-cublas-symv_syr -Dcublas=true
trmv_trsv Triangular multiply and solve run-cublas-trmv_trsv -Dcublas=true

Level 3 — Matrix–Matrix Operations

Example Description Run Command
gemm Matrix-matrix multiply (SGEMM) run-cublas-gemm -Dcublas=true
gemm_batched Strided batched GEMM run-cublas-gemm_batched -Dcublas=true
gemm_ex Mixed-precision GemmEx run-cublas-gemm_ex -Dcublas=true
symm Symmetric matrix multiply run-cublas-symm -Dcublas=true
trsm Triangular solve (STRSM) run-cublas-trsm -Dcublas=true
syrk Symmetric rank-k update run-cublas-syrk -Dcublas=true
geam Matrix add / transpose run-cublas-geam -Dcublas=true
dgmm Diagonal matrix multiply run-cublas-dgmm -Dcublas=true

Advanced GEMM with algorithm heuristics and mixed-precision support. Enable with -Dcublaslt=true.

Example Description Run Command
lt_sgemm SGEMM with heuristic algorithm selection run-cublaslt-lt_sgemm -Dcublaslt=true

Neural network primitives: convolution, activation, pooling, softmax. Enable with -Dcudnn=true.

Example Description Run Command
activation ReLU, sigmoid, tanh activation functions run-cudnn-activation -Dcudnn=true
pooling_softmax Max pooling + softmax pipeline run-cudnn-pooling_softmax -Dcudnn=true
conv2d 2D convolution forward pass run-cudnn-conv2d -Dcudnn=true

1D, 2D, and 3D FFTs with complex and real data. Enable with -Dcufft=true.

Example Description Run Command
fft_1d_c2c 1D complex-to-complex FFT run-cufft-fft_1d_c2c -Dcufft=true
fft_1d_r2c 1D real-to-complex with frequency filtering run-cufft-fft_1d_r2c -Dcufft=true
fft_2d 2D complex FFT run-cufft-fft_2d -Dcufft=true
fft_3d 3D complex FFT run-cufft-fft_3d -Dcufft=true

GPU-accelerated random number generation. Enable with -Dcurand=true.

Example Description Run Command
distributions Uniform, normal, Poisson distributions run-curand-distributions -Dcurand=true
generators Generator comparison (XORWOW, MRG32k3a, …) run-curand-generators -Dcurand=true
monte_carlo_pi Monte Carlo π estimation run-curand-monte_carlo_pi -Dcurand=true

LU, QR, Cholesky, SVD, and eigenvalue decomposition. Enable with -Dcusolver=true.

Note: devInfo is a GPU-side pointer (CudaSlice(i32)) per cuSOLVER API contract. Use stream.memcpyDtoH after ctx.synchronize() to read it on the host.

Example Description Run Command
getrf LU factorization (PA = LU) + linear solve run-cusolver-getrf -Dcusolver=true
gesvd Singular value decomposition (A = UΣVᵀ) run-cusolver-gesvd -Dcusolver=true
potrf Cholesky factorization (A = LLᵀ) + solve run-cusolver-potrf -Dcusolver=true
syevd Symmetric eigenvalue decomposition run-cusolver-syevd -Dcusolver=true
geqrf QR factorization + Q extraction run-cusolver-geqrf -Dcusolver=true

Sparse matrix operations with CSR, COO, and SpGEMM formats. Enable with -Dcusparse=true.

Example Description Run Command
spmv_csr Sparse matrix-vector multiply (CSR) run-cusparse-spmv_csr -Dcusparse=true
spmv_coo Sparse matrix-vector multiply (COO) run-cusparse-spmv_coo -Dcusparse=true
spmm_csr Sparse × dense matrix multiply run-cusparse-spmm_csr -Dcusparse=true
spgemm Sparse × sparse matrix multiply run-cusparse-spgemm -Dcusparse=true

Just-in-time CUDA kernel compilation.

Example Description Run Command
jit_compile Runtime CUDA C++ → PTX compilation run-nvrtc-jit_compile
template_kernel Multi-kernel pipeline with templated types run-nvrtc-template_kernel

Nsight-compatible range markers.

Example Description Run Command
profiling Range push/pop and point mark annotations run-nvtx-profiling -Dnvtx=true

Kernel DSL — Pure-Zig GPU Kernels (80 examples)

All kernels are written in pure Zig and compiled to PTX via Zig's built-in LLVM NVPTX backend. See kernel/README.md for the full index.

Category Examples Topics
0_Basic 8 SAXPY, RELU, dot, grid-stride, normalize
1_Reduction 5 Warp reduce, multi-block, prefix sum, scalar product
2_Matrix 6 Naive & tiled matmul, matvec, transpose, pad, diag
3_Atomics 5 Atomic ops, histograms, warp-aggregated atomics
4_SharedMemory 3 Dynamic SMEM, 1D stencil, shared mem demo
5_Warp 5 Ballot, broadcast, match, reduce, scan
6_MathAndTypes 9 FP16, complex, FFT filter, fast math, type conversion
7_Debug 2 Error checking, printf debug from GPU
8_TensorCore 11 WMMA (f16/bf16/int8/tf32), MMA (f16/fp8)
9_Advanced 8 Async copy pipeline, cooperative groups, softmax
10_Integration 24 End-to-end pipelines and benchmarks

Integration Examples (24 examples)

End-to-end integration examples using Zig kernels with CUDA libraries.

# Build all integration examples
zig build example-integration -Dgpu-arch=sm_86 -Dcublas=true -Dcufft=true ...

# Run a specific binary
./zig-out/bin/integration-<name>
Binary Description
integration-module-load-launch Driver lifecycle: PTX load + kernel launch
integration-ptx-compile-execute NVRTC compile + execute pipeline
integration-stream-callback Stream callback pattern (event-driven)
integration-stream-concurrency Multi-stream concurrent execution
integration-basic-graph CUDA Graph basics: capture and replay
integration-graph-replay-update Graph replay with node update
integration-graph-with-deps Graph with explicit dependencies
integration-scale-bias-gemm cuBLAS Scale+Bias→GEMM→ReLU pipeline
integration-residual-gemm Residual connection with GEMM
integration-error-recovery CUDA error recovery patterns
integration-oob-launch Out-of-bounds launch detection
integration-fft-filter FFT-based filter pipeline
integration-conv2d-fft 2D convolution via FFT
integration-occupancy-calc Occupancy calculator utilities
integration-monte-carlo-option Monte Carlo option pricing (GPU)
integration-particle-system Particle system simulation
integration-matmul-e2e Matrix multiply end-to-end
integration-reduction-e2e Reduction end-to-end
integration-saxpy-e2e SAXPY end-to-end
integration-multi-library Multi-library pipeline (cuBLAS + cuDNN + cuFFT)
integration-wmma-gemm-verify WMMA GEMM correctness verification
integration-attention-pipeline Attention pipeline (QK^T, softmax, V)
integration-mixed-precision-train Mixed-precision training pipeline (FP16+TF32)
integration-perf-benchmark Zig kernel vs cuBLAS (event-timed benchmark)