Pure-Zig GPU kernels compiled to PTX via Zig's built-in LLVM NVPTX backend. No CUDA C++ required β all kernels are written in Zig.
# Build a single kernel example
zig build example-kernel-0-basic-kernel_vector_add -Dgpu-arch=sm_86
# Build an entire category
zig build example-kernel-0-basic -Dgpu-arch=sm_86
# Run (after build, binary is in zig-out/bin/)
./zig-out/bin/0-basic-kernel_vector_addRequired flag: -Dgpu-arch=<arch> (e.g., sm_80, sm_86, sm_89, sm_90).
Essential kernel primitives: element-wise ops, grid-stride loops, vector ops.
| Example | Description |
|---|---|
| kernel_vector_add | Element-wise vector addition |
| kernel_saxpy | SAXPY: y = Ξ±Β·x + y |
| kernel_relu | ReLU activation in-place |
| kernel_scale_bias | Scale + bias: y = Ξ±Β·x + Ξ² |
| kernel_grid_stride | Grid-stride loop pattern |
| kernel_dot_product | Two-phase parallel dot product |
| kernel_residual_norm | Residual norm computation |
| kernel_vec3_normalize | Batch 3D vector normalization |
Warp-level and block-level parallel reduction patterns.
| Example | Description |
|---|---|
| kernel_reduce_warp | Warp shuffle reduction |
| kernel_reduce_sum | Block-level sum reduction |
| kernel_reduce_multiblock | Multi-block two-phase reduction |
| kernel_prefix_sum | Exclusive prefix scan (Blelloch) |
| kernel_scalar_product | Scalar product via dual reduction |
Matrix multiplication, transpose, extraction, and padding on GPU.
| Example | Description |
|---|---|
| kernel_matmul_naive | Naive O(NΒ³) matrix multiply |
| kernel_matmul_tiled | Tiled (shared-memory) matrix multiply |
| kernel_matvec | Matrix-vector product |
| kernel_transpose | Coalesced matrix transpose |
| kernel_extract_diag | Extract diagonal elements |
| kernel_pad_2d | 2D zero-padding |
Atomic arithmetic, histograms, and warp-aggregated patterns.
| Example | Description |
|---|---|
| kernel_atomic_ops | atomicAdd, atomicMin, atomicCAS |
| kernel_histogram | Basic histogram with atomics |
| kernel_histogram_256bin | 256-bin histogram (shared mem opt) |
| kernel_warp_aggregated_atomics | Warp-aggregated atomics (1 CAS per warp) |
| kernel_system_atomics | System-scope (cross-device) atomics |
Static and dynamic shared memory usage patterns.
| Example | Description |
|---|---|
| kernel_shared_mem_demo | Static shared memory bank access patterns |
| kernel_stencil_1d | 1D stencil with shared memory caching |
| kernel_dynamic_smem | Dynamic shared memory allocation |
Ballot, broadcast, match, and scan using warp shuffle instructions.
| Example | Description |
|---|---|
| kernel_warp_reduce | Warp shuffle reduction (__shfl_down_sync) |
| kernel_warp_broadcast | Warp broadcast (__shfl_sync) |
| kernel_warp_scan | Warp-level inclusive prefix scan |
| kernel_ballot_vote | Ballot vote: __ballot_sync, __all_sync |
| kernel_warp_match | Match lanes with equal values (__match_any_sync) |
FP16, complex numbers, type conversion, and math intrinsics.
| Example | Description |
|---|---|
| kernel_half_precision | FP16 arithmetic and conversion |
| kernel_complex_mul | Complex multiply on GPU |
| kernel_fast_math | Fast math approximations (__fmaf_rn, rsqrtf) |
| kernel_integer_intrinsics | __popc, __clz, __brev, __ffs |
| kernel_type_conversion | i8/u8/f16/bf16/f32/f64 conversions |
| kernel_math_test | Full math function coverage |
| kernel_sigmoid | Sigmoid activation |
| kernel_freq_filter | Frequency-domain filter (complex multiply) |
| kernel_signal_gen | Waveform signal generation (sin, cos) |
Error checking and GPU-side printf debugging.
| Example | Description |
|---|---|
| kernel_error_check | CUDA error detection and reporting |
| kernel_printf_debug | GPU-side printf for thread-level debugging |
WMMA and MMA matrix fragments: f16, bf16, int8, tf32, fp8.
| Example | Architecture | Description |
|---|---|---|
| kernel_wmma_gemm_f16 | sm_70+ | WMMA FP16 GEMM |
| kernel_wmma_gemm_bf16 | sm_80+ | WMMA BF16 GEMM |
| kernel_wmma_gemm_int8 | sm_72+ | WMMA INT8 GEMM |
| kernel_wmma_gemm_tf32 | sm_80+ | WMMA TF32 GEMM |
| kernel_mma_gemm_f16 | sm_70+ | MMA m16n8k16 FP16 inline PTX |
| kernel_mma_gemm_fp8 | sm_89+ | MMA FP8 (Ada/Hopper) |
| + 5 PTX variants | β | Pre-compiled PTX references (.ptx files) |
WMMA = Warp Matrix Multiply-Accumulate (WMMA API). MMA = Inline PTX
mma.syncβ exact hardware instructions.
Async copy pipelines, cooperative groups, and complex algorithms.
| Example | Description |
|---|---|
| kernel_async_copy_pipeline | cp.async pipelining (sm_80+ cp.async) |
| kernel_cooperative_groups | Cooperative groups: grid sync |
| kernel_softmax | Online softmax (numerically stable) |
| kernel_thread_fence | __threadfence memory ordering |
| kernel_particle_init | Particle system initialization |
| kernel_particle_step | Particle physics step integration |
| kernel_gbm_paths | Geometric Brownian Motion paths |
| kernel_intrinsics_coverage | Comprehensive intrinsics coverage test |
Multi-library pipelines and benchmarks. Compiled by: zig build example-integration.
See ../README.md#integration-examples for the full list.