Skip to content
View annp0's full-sized avatar

Highlights

  • Pro

Block or report annp0

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Maximum 250 characters. Please don't include any personal information such as legal names or email addresses. Markdown supported. This note will be visible to only you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse

Pinned Loading

  1. GEMM-FP16 GEMM-FP16 Public

    Experimental FP16 GEMM kernels (mma, async loads, stages, block swizzling, etc). Performance on-par with cuBLAS.

    Cuda

  2. triton-flashattn triton-flashattn Public

    Flash attention forward and backward kernel (w/ causal masking) in triton. Performance comparable with Torch SPDA.

    Jupyter Notebook

  3. CUTE-GEMM-FP16 CUTE-GEMM-FP16 Public

    Optimized FP16 GEMM kernel built with CuTe. Outperforms cublasLt on specific cases.

    Cuda 3

  4. triton-fused-ops triton-fused-ops Public

    Optimized kernels (layernorm, dropout) by fusing operations. Out-performs pytorch.

    Jupyter Notebook

  5. Kernel98 Kernel98 Public

    A small kernel with copy-on-write, demand paging, and buffer caches

    C 2

  6. flash-attention flash-attention Public

    Cuda