Skip to content

[Feature][Kernel]: DeepSeek-R1 KV Proj is Too Slow for TP #28427

@robertgshaw2-redhat

Description

@robertgshaw2-redhat

🚀 The feature, motivation and pitch

TRT-LLM has a SwapAB kernel for KV proj for DSR1. We should integrate this by collaborating with the FlashInfer team

Current situation: we run CUTLASS Block Fp8 for KV proj because DeepGEMM upstream does not support it

  • the CUTLASS Block Fp8 kernels are ~1/2 the speed of DeepGEMM for other Linear layers
  • the CUTLASS Block Fp8 kernels have padding overhead

Example Trace:
Image

note: in this example, we are using DeepGEMM for the Routed Expert. This will run with Triton / DeepGEMM Swap AB TEP in the final implementation

Alternatives

  • improve cutlass impl. IUUC in recent versions there is swap AB and an option for no padding

Additional context

cc @mgoin @yewentao256 @LucasWilkinson @MatthewBonanni @alexm-redhat

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    Status

    Backlog

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions