annp0

Follow

Nan An annp0

Follow

Researcher @ McMaster University

6 followers · 1 following

19:59 (UTC -04:00)
annan.eu.org

Achievements

Achievements

Highlights

Pro

Pinned Loading

GEMM-FP16 GEMM-FP16 Public

Experimental FP16 GEMM kernels (mma, async loads, stages, block swizzling, etc). Performance on-par with cuBLAS.

Cuda
triton-flashattn triton-flashattn Public

Flash attention forward and backward kernel (w/ causal masking) in triton. Performance comparable with Torch SPDA.

Jupyter Notebook
CUTE-GEMM-FP16 CUTE-GEMM-FP16 Public

Optimized FP16 GEMM kernel built with CuTe. Outperforms cublasLt on specific cases.

Cuda 3
triton-fused-ops triton-fused-ops Public

Optimized kernels (layernorm, dropout) by fusing operations. Out-performs pytorch.

Jupyter Notebook
Kernel98 Kernel98 Public

A small kernel with copy-on-write, demand paging, and buffer caches

C 2
flash-attention flash-attention Public

Cuda