Kernel DSL Examples

Pure-Zig GPU kernels compiled to PTX via Zig's built-in LLVM NVPTX backend. No CUDA C++ required — all kernels are written in Zig.

Building

# Build a single kernel example
zig build example-kernel-0-basic-kernel_vector_add -Dgpu-arch=sm_86

# Build an entire category
zig build example-kernel-0-basic -Dgpu-arch=sm_86

# Run (after build, binary is in zig-out/bin/)
./zig-out/bin/0-basic-kernel_vector_add

Required flag: -Dgpu-arch=<arch> (e.g., sm_80, sm_86, sm_89, sm_90).

0_Basic — Core Kernel Patterns (8 examples)

Essential kernel primitives: element-wise ops, grid-stride loops, vector ops.

Example	Description
kernel_vector_add	Element-wise vector addition
kernel_saxpy	SAXPY: y = α·x + y
kernel_relu	ReLU activation in-place
kernel_scale_bias	Scale + bias: y = α·x + β
kernel_grid_stride	Grid-stride loop pattern
kernel_dot_product	Two-phase parallel dot product
kernel_residual_norm	Residual norm computation
kernel_vec3_normalize	Batch 3D vector normalization

1_Reduction — Parallel Reductions (5 examples)

Warp-level and block-level parallel reduction patterns.

Example	Description
kernel_reduce_warp	Warp shuffle reduction
kernel_reduce_sum	Block-level sum reduction
kernel_reduce_multiblock	Multi-block two-phase reduction
kernel_prefix_sum	Exclusive prefix scan (Blelloch)
kernel_scalar_product	Scalar product via dual reduction

2_Matrix — Matrix Operations (6 examples)

Matrix multiplication, transpose, extraction, and padding on GPU.

Example	Description
kernel_matmul_naive	Naive O(N³) matrix multiply
kernel_matmul_tiled	Tiled (shared-memory) matrix multiply
kernel_matvec	Matrix-vector product
kernel_transpose	Coalesced matrix transpose
kernel_extract_diag	Extract diagonal elements
kernel_pad_2d	2D zero-padding

3_Atomics — Atomic Operations (5 examples)

Atomic arithmetic, histograms, and warp-aggregated patterns.

Example	Description
kernel_atomic_ops	`atomicAdd`, `atomicMin`, `atomicCAS`
kernel_histogram	Basic histogram with atomics
kernel_histogram_256bin	256-bin histogram (shared mem opt)
kernel_warp_aggregated_atomics	Warp-aggregated atomics (1 CAS per warp)
kernel_system_atomics	System-scope (cross-device) atomics

4_SharedMemory — Shared Memory (3 examples)

Static and dynamic shared memory usage patterns.

Example	Description
kernel_shared_mem_demo	Static shared memory bank access patterns
kernel_stencil_1d	1D stencil with shared memory caching
kernel_dynamic_smem	Dynamic shared memory allocation

5_Warp — Warp Intrinsics (5 examples)

Ballot, broadcast, match, and scan using warp shuffle instructions.

Example	Description
kernel_warp_reduce	Warp shuffle reduction (`__shfl_down_sync`)
kernel_warp_broadcast	Warp broadcast (`__shfl_sync`)
kernel_warp_scan	Warp-level inclusive prefix scan
kernel_ballot_vote	Ballot vote: `__ballot_sync`, `__all_sync`
kernel_warp_match	Match lanes with equal values (`__match_any_sync`)

6_MathAndTypes — Math & Type Operations (9 examples)

FP16, complex numbers, type conversion, and math intrinsics.

Example	Description
kernel_half_precision	FP16 arithmetic and conversion
kernel_complex_mul	Complex multiply on GPU
kernel_fast_math	Fast math approximations (`__fmaf_rn`, `rsqrtf`)
kernel_integer_intrinsics	`__popc`, `__clz`, `__brev`, `__ffs`
kernel_type_conversion	i8/u8/f16/bf16/f32/f64 conversions
kernel_math_test	Full math function coverage
kernel_sigmoid	Sigmoid activation
kernel_freq_filter	Frequency-domain filter (complex multiply)
kernel_signal_gen	Waveform signal generation (sin, cos)

7_Debug — Debug Utilities (2 examples)

Error checking and GPU-side printf debugging.

Example	Description
kernel_error_check	CUDA error detection and reporting
kernel_printf_debug	GPU-side `printf` for thread-level debugging

8_TensorCore — Tensor Core Operations (11 examples)

WMMA and MMA matrix fragments: f16, bf16, int8, tf32, fp8.

Example	Architecture	Description
kernel_wmma_gemm_f16	sm_70+	WMMA FP16 GEMM
kernel_wmma_gemm_bf16	sm_80+	WMMA BF16 GEMM
kernel_wmma_gemm_int8	sm_72+	WMMA INT8 GEMM
kernel_wmma_gemm_tf32	sm_80+	WMMA TF32 GEMM
kernel_mma_gemm_f16	sm_70+	MMA m16n8k16 FP16 inline PTX
kernel_mma_gemm_fp8	sm_89+	MMA FP8 (Ada/Hopper)
+ 5 PTX variants	—	Pre-compiled PTX references (.ptx files)

WMMA = Warp Matrix Multiply-Accumulate (WMMA API). MMA = Inline PTX mma.sync — exact hardware instructions.

9_Advanced — Advanced Patterns (8 examples)

Async copy pipelines, cooperative groups, and complex algorithms.

Example	Description
kernel_async_copy_pipeline	`cp.async` pipelining (sm_80+ cp.async)
kernel_cooperative_groups	Cooperative groups: grid sync
kernel_softmax	Online softmax (numerically stable)
kernel_thread_fence	`__threadfence` memory ordering
kernel_particle_init	Particle system initialization
kernel_particle_step	Particle physics step integration
kernel_gbm_paths	Geometric Brownian Motion paths
kernel_intrinsics_coverage	Comprehensive intrinsics coverage test

10_Integration — End-to-End Pipelines (24 examples)

Multi-library pipelines and benchmarks. Compiled by: zig build example-integration.

See ../README.md#integration-examples for the full list.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Kernel DSL Examples

Building

0_Basic — Core Kernel Patterns (8 examples)

1_Reduction — Parallel Reductions (5 examples)

2_Matrix — Matrix Operations (6 examples)

3_Atomics — Atomic Operations (5 examples)

4_SharedMemory — Shared Memory (3 examples)

5_Warp — Warp Intrinsics (5 examples)

6_MathAndTypes — Math & Type Operations (9 examples)

7_Debug — Debug Utilities (2 examples)

8_TensorCore — Tensor Core Operations (11 examples)

9_Advanced — Advanced Patterns (8 examples)

10_Integration — End-to-End Pipelines (24 examples)

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

Kernel DSL Examples

Building

0_Basic — Core Kernel Patterns (8 examples)

1_Reduction — Parallel Reductions (5 examples)

2_Matrix — Matrix Operations (6 examples)

3_Atomics — Atomic Operations (5 examples)

4_SharedMemory — Shared Memory (3 examples)

5_Warp — Warp Intrinsics (5 examples)

6_MathAndTypes — Math & Type Operations (9 examples)

7_Debug — Debug Utilities (2 examples)

8_TensorCore — Tensor Core Operations (11 examples)

9_Advanced — Advanced Patterns (8 examples)

10_Integration — End-to-End Pipelines (24 examples)