[Feature][Kernel]: DeepSeek-R1 KV Proj is Too Slow for TP

### 🚀 The feature, motivation and pitch
TRT-LLM has a SwapAB kernel for KV proj for DSR1. We should integrate this by collaborating with the FlashInfer team

Current situation: we run CUTLASS Block Fp8 for KV proj because DeepGEMM upstream does not support it
- the CUTLASS Block Fp8 kernels are ~1/2 the speed of DeepGEMM for other Linear layers
- the CUTLASS Block Fp8 kernels have padding overhead

Example Trace:
<img width="1303" height="352" alt="Image" src="https://github.com/user-attachments/assets/dd3baaae-662a-48f6-8add-b99e17abb936" />
> note: in this example, we are using DeepGEMM for the Routed Expert. This will run with Triton / DeepGEMM Swap AB TEP in the final implementation


### Alternatives

- improve cutlass impl. IUUC in recent versions there is swap AB and an option for no padding

### Additional context

cc @mgoin @yewentao256 @LucasWilkinson @MatthewBonanni @alexm-redhat 

### Before submitting a new issue...

- [x] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the [documentation page](https://docs.vllm.ai/en/latest/), which can answer lots of frequently asked questions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[Feature][Kernel]: DeepSeek-R1 KV Proj is Too Slow for TP #28427

🚀 The feature, motivation and pitch

Alternatives

Additional context

Before submitting a new issue...

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

[Feature][Kernel]: DeepSeek-R1 KV Proj is Too Slow for TP #28427

Description

🚀 The feature, motivation and pitch

Alternatives

Additional context

Before submitting a new issue...

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions