-
-
Notifications
You must be signed in to change notification settings - Fork 11.6k
Open
Labels
Description
🚀 The feature, motivation and pitch
TRT-LLM has a SwapAB kernel for KV proj for DSR1. We should integrate this by collaborating with the FlashInfer team
Current situation: we run CUTLASS Block Fp8 for KV proj because DeepGEMM upstream does not support it
- the CUTLASS Block Fp8 kernels are ~1/2 the speed of DeepGEMM for other Linear layers
- the CUTLASS Block Fp8 kernels have padding overhead
note: in this example, we are using DeepGEMM for the Routed Expert. This will run with Triton / DeepGEMM Swap AB TEP in the final implementation
Alternatives
- improve cutlass impl. IUUC in recent versions there is swap AB and an option for no padding
Additional context
cc @mgoin @yewentao256 @LucasWilkinson @MatthewBonanni @alexm-redhat
Before submitting a new issue...
- Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.
Metadata
Metadata
Assignees
Labels
Type
Projects
Status
Backlog
