Vitamin-CUDA 🧠

One Kernel a Day, Keeps High Latency Away. 🚀

Welcome to your daily dose of CUDA programming! Vitamin-CUDA is a curated collection of hands-on CUDA practices, designed to take you from Hello World to High Performance. Whether you are a beginner looking to understand the grid-stride loop or an enthusiast diving into warp-level primitives, there's a kernel here for you.

💊Let's get started and happy coding! 🧠

Prerequisites 🛠️

NVIDIA GPU (Compute Capability 6.0+)
CUDA Toolkit 11.0+
C++ Compiler (GCC/Clang/MSVC)
CMake 3.18+ (Optional, but recommended)
PyTorch (For extension examples/python binding and performence comparation)

I recommend using nvidia pytorch ngc docker images for a quick start! Refer to https://catalog.ngc.nvidia.com/orgs/nvidia/containers/pytorch

Kernels (80+ kernels)

All kernels were tested on an RTX 5060 GPU (unless otherwise specified) and benchmarked against PyTorch 2.9

Easy (🌟~🌟🌟)

elementwise: elementwise add
- elementwise_add fp32/fp16 版
- elementwise_add_fp16x2(fp16向量化)
- elementwise_add_fp16x8(fp16向量化)
- elementwise_add_fp16x8(fp16向量化, packed r/w)
- pytorch op bindings && diff check
sigmoid
- sigmoid fp32/fp16 版
- sigmoid_fp16x2(fp16向量化)
- sigmoid_fp16x8(fp16向量化)
- sigmoid_fp16x8(fp16向量化, packed r/w)
- pytorch op bindings && diff check
swish
- swish fp32/fp16 版
- swish_fp16x2(fp16向量化)
- swish_fp16x8(fp16向量化)
- swish_fp16x8(fp16向量化, packed r/w)
- pytorch op bindings && diff check
relu
- relu fp32/fp16 版
- relu_fp16x2(fp16向量化)
- relu_fp16x8(fp16向量化)
- relu_fp16x8(fp16向量化, packed r/w)
- pytorch op bindings && diff check
relu6
- relu6 fp32/fp16 版
- relu6_fp16x2(fp16向量化)
- relu6_fp16x8(fp16向量化)
- relu6_fp16x8(fp16向量化, packed r/w)
- pytorch op bindings && diff check
elu
- elu fp32/fp16 版
- elu_fp16x2(fp16向量化)
- elu_fp16x8(fp16向量化)
- elu_fp16x8(fp16向量化, packed r/w, half2 近两倍提升)
- pytorch op bindings && diff check
gelu
- gelu fp32/fp16 版
- gelu_fp16x2(fp16向量化)
- gelu_fp16x8(fp16向量化)
- gelu_fp16x8(fp16向量化，packed r/w)
- pytorch op bindings && diff check
hardswish
- hardswish fp32/fp16 版
- hardswish_fp16x2(fp16向量化)
- hardswish_fp16x8(fp16向量化)
- hardswish_fp16x8(fp16向量化, packed r/w)
- pytorch op bindings && diff check
embedding
- embedding fp32/fp16 版
- embedding_fp32x4(fp32向量化)
- embedding_fp32x4(fp32向量化, packed r/w)
- embedding_fp16x2(fp16向量化)
- embedding_fp16x8(fp16向量化)
- embedding_fp16x8(fp16向量化, packed r/w)
- pytorch op bindings && diff check
rope
- pytorch naive rope
- pytorch rope with cos/sin table
- rope fp32 版 (比pytorch naive 实现快一个数量级)
- rope fp32x4 版 (fp32向量化，稍大规模后快几十倍)
- pytorch op bindings && diff check

Medium (🌟🌟~🌟🌟🌟)

reduce : 基于 warp shuffle add
- reduce_sum fp32/fp16 版
- reduce_sum_fp16x2(fp16向量化)
- reduce_sum_fp16x8_packed(fp16向量化, packed r/w)
- reduce_sum int8 版
- reduce_sum_i8x16_packed (int8向量化，packed r/w)
- reduce_sum_i8x16_packed (int8向量化，packed r/w, dp4a, 相比torch朴素实现快几十倍)
- reduce_sum_i8x64_packed (int8向量化，packed r/w, dp4a)
- pytorch op bindings && diff check
dot_product
- dot_product fp32/fp16 版
- dot_product_fp32x4(fp32向量化)
- dot_product_fp16x2(fp16向量化)
- dot_product_fp16x8(fp16向量化, packed r/w)
- pytorch op bindings && diff check
softmax
- safe online softmax fp32/fp16 版
- safe online softmax fp32x4 版 (fp32向量化)
- safe online softmax fp16x8 版 (fp16向量化, packed r/w)
- pytorch op bindings && diff check
rmsnorm
- naive torch rmsnorm
- rmsnorm fp32/fp16 版
- rmsnorm fp32x4 版 (fp32向量化)
- rmsnorm_fp32x4_smem
- rmsnorm fp16x8 版 (fp16向量化, packed r/w)
- rmsnorm_fp16x8_smem 版 (fp16向量化, packed r/w)
- pytorch op bindings && diff check
transpose
- transpose_coalesced_read (input视角，合并读)
- transpose_coalesced_write (output视角，合并写)
- transpose_smem (共享内存缓存，块状读写)
- transpose_smem_bcf (共享内存无冲突版)
- transpose_smem_packed_bcf (共享内存无冲突版，float4向量化读写)
- transpose_smem_swizzled_packed (共享内存无冲突版，float4向量化读写)
- pytorch op bindings && diff check

Hard (🌟🌟🌟~🌟🌟🌟🌟)

sgemv
- gemv fp32版
- gemv fp32x4（向量化读取）
- pytorch op bindings && diff check
sgemm
- [] sgemm fp32版

Samples

deviceQuery

Name		Name	Last commit message	Last commit date
Latest commit History 81 Commits
docs		docs
kernels		kernels
samples		samples
.clang-format		.clang-format
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
LICENSE		LICENSE
README.MD		README.MD

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Vitamin-CUDA 🧠

Contents 📖

Prerequisites 🛠️

Kernels (80+ kernels)

Easy (🌟~🌟🌟)

Medium (🌟🌟~🌟🌟🌟)

Hard (🌟🌟🌟~🌟🌟🌟🌟)

Samples

Reference

About

Uh oh!

Languages

License

WingEdge777/vitamin-cuda

Folders and files

Latest commit

History

Repository files navigation

Vitamin-CUDA 🧠

Contents 📖

Prerequisites 🛠️

Kernels (80+ kernels)

Easy (🌟~🌟🌟)

Medium (🌟🌟~🌟🌟🌟)

Hard (🌟🌟🌟~🌟🌟🌟🌟)

Samples

Reference

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Languages