Name	Name	Last commit message	Last commit date
parent directory ..
basics	basics
cublas	cublas
cublaslt	cublaslt
cudnn	cudnn
cufft	cufft
curand	curand
cusolver	cusolver
cusparse	cusparse
kernel	kernel
nvrtc	nvrtc
nvtx	nvtx
README.md	README.md

zCUDA Examples

A comprehensive collection of 159 examples demonstrating every module and capability of the zCUDA library. All host examples use the safe API layer exclusively.

Overview

Category	Examples	Notes
Basics (Driver API)	16	Contexts, streams, memory, events
cuBLAS	19	BLAS L1/L2/L3 + mixed-precision
cuBLAS LT	1	Lightweight BLAS with heuristics
cuDNN	3	Conv, activation, pooling
cuFFT	4	1D/2D/3D complex and real FFTs
cuRAND	3	GPU random number generation
cuSOLVER	5	LU, QR, Cholesky, SVD, eigenvalue
cuSPARSE	4	SpMV (CSR/COO), SpMM, SpGEMM
NVRTC	2	JIT kernel compilation
NVTX	1	Profiling annotations
Kernel DSL — all categories	80	Pure-Zig GPU kernels (11 categories)
Integration	24	End-to-end pipelines and benchmarks
Total	162	—

Building & Running

# Run a host example (run-<category>-<name>)
zig build run-basics-vector_add
zig build run-cublas-gemm -Dcublas=true
zig build run-cudnn-conv2d -Dcudnn=true

# Build a specific kernel example (example-kernel-<cat>-<name>)
zig build example-kernel-0-basic-kernel_vector_add -Dgpu-arch=sm_86

# Build all integration examples at once
zig build example-integration -Dcublas=true -Dcufft=true ...

# Run all integration binaries
zig-out/bin/integration-<name>

Library flags: -Dcublas=true, -Dcublaslt=true, -Dcudnn=true, -Dcufft=true, -Dcurand=true, -Dcusolver=true, -Dcusparse=true, -Dnvtx=true.

Basics — CUDA Driver API (16 examples)

Core GPU programming: contexts, streams, memory management, events, kernels.

Example	Description	Run Command
vector_add	Vector addition via JIT kernel	`run-basics-vector_add`
device_info	GPU specs: memory, compute, features	`run-basics-device_info`
event_timing	Event-based timing & bandwidth measurement	`run-basics-event_timing`
streams	Multi-stream concurrent execution	`run-basics-streams`
peer_to_peer	Multi-GPU peer access and cross-device copy	`run-basics-peer_to_peer`
constant_memory	GPU constant memory for polynomial eval	`run-basics-constant_memory`
struct_kernel	Pass Zig `extern struct` to GPU kernel	`run-basics-struct_kernel`
kernel_attributes	Query kernel registers, shared mem, occupancy	`run-basics-kernel_attributes`
alloc_patterns	Device, host, pinned, and unified allocation	`run-basics-alloc_patterns`
async_memcpy	Async H2D/D2H transfers with streams	`run-basics-async_memcpy`
pinned_memory	Pinned (page-locked) memory for faster transfers	`run-basics-pinned_memory`
unified_memory	Unified memory (UM) migration and access	`run-basics-unified_memory`
context_lifecycle	Context creation, binding, and destruction	`run-basics-context_lifecycle`
dtod_copy_chain	Device-to-device chained copy pipeline	`run-basics-dtod_copy_chain`
memset_patterns	Device memset patterns and initialization	`run-basics-memset_patterns`
multi_device_query	Enumerate and query all CUDA devices	`run-basics-multi_device_query`

cuBLAS — Dense Linear Algebra (19 examples)

BLAS Level 1, 2, and 3 operations. Enable with -Dcublas=true.

Level 1 — Vector–Vector Operations

Example	Description	Run Command
axpy	SAXPY: y = α·x + y	`run-cublas-axpy -Dcublas=true`
dot	Dot product	`run-cublas-dot -Dcublas=true`
nrm2_asum	L1 and L2 vector norms	`run-cublas-nrm2_asum -Dcublas=true`
scal	Vector scaling: x = α·x	`run-cublas-scal -Dcublas=true`
amax_amin	Index of max/min absolute value	`run-cublas-amax_amin -Dcublas=true`
swap_copy	Vector swap and copy	`run-cublas-swap_copy -Dcublas=true`
rot	Givens rotation	`run-cublas-rot -Dcublas=true`
cosine_similarity	Cosine similarity via L1 ops	`run-cublas-cosine_similarity -Dcublas=true`

Level 2 — Matrix–Vector Operations

Example	Description	Run Command
gemv	Matrix-vector multiply (SGEMV)	`run-cublas-gemv -Dcublas=true`
symv_syr	Symmetric matrix-vector ops	`run-cublas-symv_syr -Dcublas=true`
trmv_trsv	Triangular multiply and solve	`run-cublas-trmv_trsv -Dcublas=true`

Level 3 — Matrix–Matrix Operations

Example	Description	Run Command
gemm	Matrix-matrix multiply (SGEMM)	`run-cublas-gemm -Dcublas=true`
gemm_batched	Strided batched GEMM	`run-cublas-gemm_batched -Dcublas=true`
gemm_ex	Mixed-precision GemmEx	`run-cublas-gemm_ex -Dcublas=true`
symm	Symmetric matrix multiply	`run-cublas-symm -Dcublas=true`
trsm	Triangular solve (STRSM)	`run-cublas-trsm -Dcublas=true`
syrk	Symmetric rank-k update	`run-cublas-syrk -Dcublas=true`
geam	Matrix add / transpose	`run-cublas-geam -Dcublas=true`
dgmm	Diagonal matrix multiply	`run-cublas-dgmm -Dcublas=true`

cuBLAS LT — Lightweight BLAS (1 example)

Advanced GEMM with algorithm heuristics and mixed-precision support. Enable with -Dcublaslt=true.

Example	Description	Run Command
lt_sgemm	SGEMM with heuristic algorithm selection	`run-cublaslt-lt_sgemm -Dcublaslt=true`

cuDNN — Deep Neural Networks (3 examples)

Neural network primitives: convolution, activation, pooling, softmax. Enable with -Dcudnn=true.

Example	Description	Run Command
activation	ReLU, sigmoid, tanh activation functions	`run-cudnn-activation -Dcudnn=true`
pooling_softmax	Max pooling + softmax pipeline	`run-cudnn-pooling_softmax -Dcudnn=true`
conv2d	2D convolution forward pass	`run-cudnn-conv2d -Dcudnn=true`

cuFFT — Fast Fourier Transform (4 examples)

1D, 2D, and 3D FFTs with complex and real data. Enable with -Dcufft=true.

Example	Description	Run Command
fft_1d_c2c	1D complex-to-complex FFT	`run-cufft-fft_1d_c2c -Dcufft=true`
fft_1d_r2c	1D real-to-complex with frequency filtering	`run-cufft-fft_1d_r2c -Dcufft=true`
fft_2d	2D complex FFT	`run-cufft-fft_2d -Dcufft=true`
fft_3d	3D complex FFT	`run-cufft-fft_3d -Dcufft=true`

cuRAND — Random Number Generation (3 examples)

GPU-accelerated random number generation. Enable with -Dcurand=true.

Example	Description	Run Command
distributions	Uniform, normal, Poisson distributions	`run-curand-distributions -Dcurand=true`
generators	Generator comparison (XORWOW, MRG32k3a, …)	`run-curand-generators -Dcurand=true`
monte_carlo_pi	Monte Carlo π estimation	`run-curand-monte_carlo_pi -Dcurand=true`

cuSOLVER — Dense Solvers (5 examples)

LU, QR, Cholesky, SVD, and eigenvalue decomposition. Enable with -Dcusolver=true.

Note: devInfo is a GPU-side pointer (CudaSlice(i32)) per cuSOLVER API contract. Use stream.memcpyDtoH after ctx.synchronize() to read it on the host.

Example	Description	Run Command
getrf	LU factorization (PA = LU) + linear solve	`run-cusolver-getrf -Dcusolver=true`
gesvd	Singular value decomposition (A = UΣVᵀ)	`run-cusolver-gesvd -Dcusolver=true`
potrf	Cholesky factorization (A = LLᵀ) + solve	`run-cusolver-potrf -Dcusolver=true`
syevd	Symmetric eigenvalue decomposition	`run-cusolver-syevd -Dcusolver=true`
geqrf	QR factorization + Q extraction	`run-cusolver-geqrf -Dcusolver=true`

cuSPARSE — Sparse Linear Algebra (4 examples)

Sparse matrix operations with CSR, COO, and SpGEMM formats. Enable with -Dcusparse=true.

Example	Description	Run Command
spmv_csr	Sparse matrix-vector multiply (CSR)	`run-cusparse-spmv_csr -Dcusparse=true`
spmv_coo	Sparse matrix-vector multiply (COO)	`run-cusparse-spmv_coo -Dcusparse=true`
spmm_csr	Sparse × dense matrix multiply	`run-cusparse-spmm_csr -Dcusparse=true`
spgemm	Sparse × sparse matrix multiply	`run-cusparse-spgemm -Dcusparse=true`

NVRTC — Runtime Compilation (2 examples)

Just-in-time CUDA kernel compilation.

Example	Description	Run Command
jit_compile	Runtime CUDA C++ → PTX compilation	`run-nvrtc-jit_compile`
template_kernel	Multi-kernel pipeline with templated types	`run-nvrtc-template_kernel`

NVTX — Profiling Annotations (1 example)

Nsight-compatible range markers.

Example	Description	Run Command
profiling	Range push/pop and point mark annotations	`run-nvtx-profiling -Dnvtx=true`

Kernel DSL — Pure-Zig GPU Kernels (80 examples)

All kernels are written in pure Zig and compiled to PTX via Zig's built-in LLVM NVPTX backend. See kernel/README.md for the full index.

Category	Examples	Topics
0_Basic	8	SAXPY, RELU, dot, grid-stride, normalize
1_Reduction	5	Warp reduce, multi-block, prefix sum, scalar product
2_Matrix	6	Naive & tiled matmul, matvec, transpose, pad, diag
3_Atomics	5	Atomic ops, histograms, warp-aggregated atomics
4_SharedMemory	3	Dynamic SMEM, 1D stencil, shared mem demo
5_Warp	5	Ballot, broadcast, match, reduce, scan
6_MathAndTypes	9	FP16, complex, FFT filter, fast math, type conversion
7_Debug	2	Error checking, `printf` debug from GPU
8_TensorCore	11	WMMA (f16/bf16/int8/tf32), MMA (f16/fp8)
9_Advanced	8	Async copy pipeline, cooperative groups, softmax
10_Integration	24	End-to-end pipelines and benchmarks

Integration Examples (24 examples)

End-to-end integration examples using Zig kernels with CUDA libraries.

# Build all integration examples
zig build example-integration -Dgpu-arch=sm_86 -Dcublas=true -Dcufft=true ...

# Run a specific binary
./zig-out/bin/integration-<name>

Binary	Description
integration-module-load-launch	Driver lifecycle: PTX load + kernel launch
integration-ptx-compile-execute	NVRTC compile + execute pipeline
integration-stream-callback	Stream callback pattern (event-driven)
integration-stream-concurrency	Multi-stream concurrent execution
integration-basic-graph	CUDA Graph basics: capture and replay
integration-graph-replay-update	Graph replay with node update
integration-graph-with-deps	Graph with explicit dependencies
integration-scale-bias-gemm	cuBLAS Scale+Bias→GEMM→ReLU pipeline
integration-residual-gemm	Residual connection with GEMM
integration-error-recovery	CUDA error recovery patterns
integration-oob-launch	Out-of-bounds launch detection
integration-fft-filter	FFT-based filter pipeline
integration-conv2d-fft	2D convolution via FFT
integration-occupancy-calc	Occupancy calculator utilities
integration-monte-carlo-option	Monte Carlo option pricing (GPU)
integration-particle-system	Particle system simulation
integration-matmul-e2e	Matrix multiply end-to-end
integration-reduction-e2e	Reduction end-to-end
integration-saxpy-e2e	SAXPY end-to-end
integration-multi-library	Multi-library pipeline (cuBLAS + cuDNN + cuFFT)
integration-wmma-gemm-verify	WMMA GEMM correctness verification
integration-attention-pipeline	Attention pipeline (QK^T, softmax, V)
integration-mixed-precision-train	Mixed-precision training pipeline (FP16+TF32)
integration-perf-benchmark	Zig kernel vs cuBLAS (event-timed benchmark)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

zCUDA Examples

Overview

Building & Running

Basics — CUDA Driver API (16 examples)

cuBLAS — Dense Linear Algebra (19 examples)

Level 1 — Vector–Vector Operations

Level 2 — Matrix–Vector Operations

Level 3 — Matrix–Matrix Operations

cuBLAS LT — Lightweight BLAS (1 example)

cuDNN — Deep Neural Networks (3 examples)

cuFFT — Fast Fourier Transform (4 examples)

cuRAND — Random Number Generation (3 examples)

cuSOLVER — Dense Solvers (5 examples)

cuSPARSE — Sparse Linear Algebra (4 examples)

NVRTC — Runtime Compilation (2 examples)

NVTX — Profiling Annotations (1 example)

Kernel DSL — Pure-Zig GPU Kernels (80 examples)

Integration Examples (24 examples)

FilesExpand file tree

examples

Directory actions

More options

Directory actions

More options

Latest commit

History

examples

Folders and files

parent directory

README.md

zCUDA Examples

Overview

Building & Running

Basics — CUDA Driver API (16 examples)

cuBLAS — Dense Linear Algebra (19 examples)

Level 1 — Vector–Vector Operations

Level 2 — Matrix–Vector Operations

Level 3 — Matrix–Matrix Operations

cuBLAS LT — Lightweight BLAS (1 example)

cuDNN — Deep Neural Networks (3 examples)

cuFFT — Fast Fourier Transform (4 examples)

cuRAND — Random Number Generation (3 examples)

cuSOLVER — Dense Solvers (5 examples)

cuSPARSE — Sparse Linear Algebra (4 examples)

NVRTC — Runtime Compilation (2 examples)

NVTX — Profiling Annotations (1 example)

Kernel DSL — Pure-Zig GPU Kernels (80 examples)

Integration Examples (24 examples)