Skip to content

Latest commit

Β 

History

History
178 lines (127 loc) Β· 7.99 KB

File metadata and controls

178 lines (127 loc) Β· 7.99 KB

Kernel DSL Examples

Pure-Zig GPU kernels compiled to PTX via Zig's built-in LLVM NVPTX backend. No CUDA C++ required β€” all kernels are written in Zig.

Building

# Build a single kernel example
zig build example-kernel-0-basic-kernel_vector_add -Dgpu-arch=sm_86

# Build an entire category
zig build example-kernel-0-basic -Dgpu-arch=sm_86

# Run (after build, binary is in zig-out/bin/)
./zig-out/bin/0-basic-kernel_vector_add

Required flag: -Dgpu-arch=<arch> (e.g., sm_80, sm_86, sm_89, sm_90).


0_Basic β€” Core Kernel Patterns (8 examples)

Essential kernel primitives: element-wise ops, grid-stride loops, vector ops.

Example Description
kernel_vector_add Element-wise vector addition
kernel_saxpy SAXPY: y = Ξ±Β·x + y
kernel_relu ReLU activation in-place
kernel_scale_bias Scale + bias: y = Ξ±Β·x + Ξ²
kernel_grid_stride Grid-stride loop pattern
kernel_dot_product Two-phase parallel dot product
kernel_residual_norm Residual norm computation
kernel_vec3_normalize Batch 3D vector normalization

1_Reduction β€” Parallel Reductions (5 examples)

Warp-level and block-level parallel reduction patterns.

Example Description
kernel_reduce_warp Warp shuffle reduction
kernel_reduce_sum Block-level sum reduction
kernel_reduce_multiblock Multi-block two-phase reduction
kernel_prefix_sum Exclusive prefix scan (Blelloch)
kernel_scalar_product Scalar product via dual reduction

2_Matrix β€” Matrix Operations (6 examples)

Matrix multiplication, transpose, extraction, and padding on GPU.

Example Description
kernel_matmul_naive Naive O(NΒ³) matrix multiply
kernel_matmul_tiled Tiled (shared-memory) matrix multiply
kernel_matvec Matrix-vector product
kernel_transpose Coalesced matrix transpose
kernel_extract_diag Extract diagonal elements
kernel_pad_2d 2D zero-padding

3_Atomics β€” Atomic Operations (5 examples)

Atomic arithmetic, histograms, and warp-aggregated patterns.

Example Description
kernel_atomic_ops atomicAdd, atomicMin, atomicCAS
kernel_histogram Basic histogram with atomics
kernel_histogram_256bin 256-bin histogram (shared mem opt)
kernel_warp_aggregated_atomics Warp-aggregated atomics (1 CAS per warp)
kernel_system_atomics System-scope (cross-device) atomics

4_SharedMemory β€” Shared Memory (3 examples)

Static and dynamic shared memory usage patterns.

Example Description
kernel_shared_mem_demo Static shared memory bank access patterns
kernel_stencil_1d 1D stencil with shared memory caching
kernel_dynamic_smem Dynamic shared memory allocation

5_Warp β€” Warp Intrinsics (5 examples)

Ballot, broadcast, match, and scan using warp shuffle instructions.

Example Description
kernel_warp_reduce Warp shuffle reduction (__shfl_down_sync)
kernel_warp_broadcast Warp broadcast (__shfl_sync)
kernel_warp_scan Warp-level inclusive prefix scan
kernel_ballot_vote Ballot vote: __ballot_sync, __all_sync
kernel_warp_match Match lanes with equal values (__match_any_sync)

6_MathAndTypes β€” Math & Type Operations (9 examples)

FP16, complex numbers, type conversion, and math intrinsics.

Example Description
kernel_half_precision FP16 arithmetic and conversion
kernel_complex_mul Complex multiply on GPU
kernel_fast_math Fast math approximations (__fmaf_rn, rsqrtf)
kernel_integer_intrinsics __popc, __clz, __brev, __ffs
kernel_type_conversion i8/u8/f16/bf16/f32/f64 conversions
kernel_math_test Full math function coverage
kernel_sigmoid Sigmoid activation
kernel_freq_filter Frequency-domain filter (complex multiply)
kernel_signal_gen Waveform signal generation (sin, cos)

7_Debug β€” Debug Utilities (2 examples)

Error checking and GPU-side printf debugging.

Example Description
kernel_error_check CUDA error detection and reporting
kernel_printf_debug GPU-side printf for thread-level debugging

8_TensorCore β€” Tensor Core Operations (11 examples)

WMMA and MMA matrix fragments: f16, bf16, int8, tf32, fp8.

Example Architecture Description
kernel_wmma_gemm_f16 sm_70+ WMMA FP16 GEMM
kernel_wmma_gemm_bf16 sm_80+ WMMA BF16 GEMM
kernel_wmma_gemm_int8 sm_72+ WMMA INT8 GEMM
kernel_wmma_gemm_tf32 sm_80+ WMMA TF32 GEMM
kernel_mma_gemm_f16 sm_70+ MMA m16n8k16 FP16 inline PTX
kernel_mma_gemm_fp8 sm_89+ MMA FP8 (Ada/Hopper)
+ 5 PTX variants β€” Pre-compiled PTX references (.ptx files)

WMMA = Warp Matrix Multiply-Accumulate (WMMA API). MMA = Inline PTX mma.sync β€” exact hardware instructions.


9_Advanced β€” Advanced Patterns (8 examples)

Async copy pipelines, cooperative groups, and complex algorithms.

Example Description
kernel_async_copy_pipeline cp.async pipelining (sm_80+ cp.async)
kernel_cooperative_groups Cooperative groups: grid sync
kernel_softmax Online softmax (numerically stable)
kernel_thread_fence __threadfence memory ordering
kernel_particle_init Particle system initialization
kernel_particle_step Particle physics step integration
kernel_gbm_paths Geometric Brownian Motion paths
kernel_intrinsics_coverage Comprehensive intrinsics coverage test

10_Integration β€” End-to-End Pipelines (24 examples)

Multi-library pipelines and benchmarks. Compiled by: zig build example-integration.

See ../README.md#integration-examples for the full list.