One Kernel a Day, Keeps High Latency Away. ๐
Welcome to your daily dose of CUDA programming! Vitamin-CUDA is a curated collection of hands-on CUDA practices, designed to take you from Hello World to High Performance. Whether you are a beginner looking to understand the grid-stride loop or an enthusiast diving into warp-level primitives, there's a kernel here for you.
๐Let's get started and happy coding! ๐ง
- NVIDIA GPU (Compute Capability 6.0+)
- CUDA Toolkit 11.0+
- C++ Compiler (GCC/Clang/MSVC)
- CMake 3.18+ (Optional, but recommended)
- PyTorch (For extension examples/python binding and performence comparation)
I recommend using nvidia pytorch ngc docker images for a quick start! Refer to https://catalog.ngc.nvidia.com/orgs/nvidia/containers/pytorch
All kernels were tested on an RTX 5060 GPU (unless otherwise specified) and benchmarked against PyTorch 2.9
- elementwise: elementwise add
- elementwise_add fp32/fp16 ็
- elementwise_add_fp16x2(fp16ๅ้ๅ)
- elementwise_add_fp16x8(fp16ๅ้ๅ)
- elementwise_add_fp16x8(fp16ๅ้ๅ, packed r/w)
- pytorch op bindings && diff check
- sigmoid
- sigmoid fp32/fp16 ็
- sigmoid_fp16x2(fp16ๅ้ๅ)
- sigmoid_fp16x8(fp16ๅ้ๅ)
- sigmoid_fp16x8(fp16ๅ้ๅ, packed r/w)
- pytorch op bindings && diff check
- swish
- swish fp32/fp16 ็
- swish_fp16x2(fp16ๅ้ๅ)
- swish_fp16x8(fp16ๅ้ๅ)
- swish_fp16x8(fp16ๅ้ๅ, packed r/w)
- pytorch op bindings && diff check
- relu
- relu fp32/fp16 ็
- relu_fp16x2(fp16ๅ้ๅ)
- relu_fp16x8(fp16ๅ้ๅ)
- relu_fp16x8(fp16ๅ้ๅ, packed r/w)
- pytorch op bindings && diff check
- relu6
- relu6 fp32/fp16 ็
- relu6_fp16x2(fp16ๅ้ๅ)
- relu6_fp16x8(fp16ๅ้ๅ)
- relu6_fp16x8(fp16ๅ้ๅ, packed r/w)
- pytorch op bindings && diff check
- elu
- elu fp32/fp16 ็
- elu_fp16x2(fp16ๅ้ๅ)
- elu_fp16x8(fp16ๅ้ๅ)
- elu_fp16x8(fp16ๅ้ๅ, packed r/w, half2 ่ฟไธคๅๆๅ)
- pytorch op bindings && diff check
- gelu
- gelu fp32/fp16 ็
- gelu_fp16x2(fp16ๅ้ๅ)
- gelu_fp16x8(fp16ๅ้ๅ)
- gelu_fp16x8(fp16ๅ้ๅ๏ผpacked r/w)
- pytorch op bindings && diff check
- hardswish
- hardswish fp32/fp16 ็
- hardswish_fp16x2(fp16ๅ้ๅ)
- hardswish_fp16x8(fp16ๅ้ๅ)
- hardswish_fp16x8(fp16ๅ้ๅ, packed r/w)
- pytorch op bindings && diff check
- embedding
- embedding fp32/fp16 ็
- embedding_fp32x4(fp32ๅ้ๅ)
- embedding_fp32x4(fp32ๅ้ๅ, packed r/w)
- embedding_fp16x2(fp16ๅ้ๅ)
- embedding_fp16x8(fp16ๅ้ๅ)
- embedding_fp16x8(fp16ๅ้ๅ, packed r/w)
- pytorch op bindings && diff check
- rope
- pytorch naive rope
- pytorch rope with cos/sin table
- rope fp32 ็ (ๆฏpytorch naive ๅฎ็ฐๅฟซไธไธชๆฐ้็บง)
- rope fp32x4 ็ (fp32ๅ้ๅ๏ผ็จๅคง่งๆจกๅๅฟซๅ ๅๅ)
- pytorch op bindings && diff check
- reduce : ๅบไบ warp shuffle add
- reduce_sum fp32/fp16 ็
- reduce_sum_fp16x2(fp16ๅ้ๅ)
- reduce_sum_fp16x8_packed(fp16ๅ้ๅ, packed r/w)
- reduce_sum int8 ็
- reduce_sum_i8x16_packed (int8ๅ้ๅ๏ผpacked r/w)
- reduce_sum_i8x16_packed (int8ๅ้ๅ๏ผpacked r/w, dp4a, ็ธๆฏtorchๆด็ด ๅฎ็ฐๅฟซๅ ๅๅ)
- reduce_sum_i8x64_packed (int8ๅ้ๅ๏ผpacked r/w, dp4a)
- pytorch op bindings && diff check
- dot_product
- dot_product fp32/fp16 ็
- dot_product_fp32x4(fp32ๅ้ๅ)
- dot_product_fp16x2(fp16ๅ้ๅ)
- dot_product_fp16x8(fp16ๅ้ๅ, packed r/w)
- pytorch op bindings && diff check
- softmax
- safe online softmax fp32/fp16 ็
- safe online softmax fp32x4 ็ (fp32ๅ้ๅ)
- safe online softmax fp16x8 ็ (fp16ๅ้ๅ, packed r/w)
- pytorch op bindings && diff check
- rmsnorm
- naive torch rmsnorm
- rmsnorm fp32/fp16 ็
- rmsnorm fp32x4 ็ (fp32ๅ้ๅ)
- rmsnorm_fp32x4_smem
- rmsnorm fp16x8 ็ (fp16ๅ้ๅ, packed r/w)
- rmsnorm_fp16x8_smem ็ (fp16ๅ้ๅ, packed r/w)
- pytorch op bindings && diff check
- transpose
- transpose_coalesced_read (input่ง่ง๏ผๅๅนถ่ฏป)
- transpose_coalesced_write (output่ง่ง๏ผๅๅนถๅ)
- transpose_smem (ๅ ฑไบซๅ ๅญ็ผๅญ๏ผๅ็ถ่ฏปๅ)
- transpose_smem_bcf (ๅ ฑไบซๅ ๅญๆ ๅฒ็ช็)
- transpose_smem_packed_bcf (ๅ ฑไบซๅ ๅญๆ ๅฒ็ช็๏ผfloat4ๅ้ๅ่ฏปๅ)
- transpose_smem_swizzled_packed (ๅ ฑไบซๅ ๅญๆ ๅฒ็ช็๏ผfloat4ๅ้ๅ่ฏปๅ)
- pytorch op bindings && diff check