Highlights
- Pro
Pinned Loading
-
GEMM-FP16
GEMM-FP16 PublicExperimental FP16 GEMM kernels (mma, async loads, stages, block swizzling, etc). Performance on-par with cuBLAS.
Cuda
-
triton-flashattn
triton-flashattn PublicFlash attention forward and backward kernel (w/ causal masking) in triton. Performance comparable with Torch SPDA.
Jupyter Notebook
-
CUTE-GEMM-FP16
CUTE-GEMM-FP16 PublicOptimized FP16 GEMM kernel built with CuTe. Outperforms cublasLt on specific cases.
Cuda 3
-
triton-fused-ops
triton-fused-ops PublicOptimized kernels (layernorm, dropout) by fusing operations. Out-performs pytorch.
Jupyter Notebook
-
Something went wrong, please refresh the page to try again.
If the problem persists, check the GitHub status page or contact support.
If the problem persists, check the GitHub status page or contact support.
