Decoding Attention is specially optimized for multi head attention (MHA) using CUDA core for the decoding stage of LLM inference.
gpu cuda inference nvidia mha multi-head-attention llm large-language-model flash-attention cuda-core decoding-attention flashinfer
-
Updated
Feb 26, 2025 - C++