Deepseek's Groundbreaking MLA and FP8 GEMM Innovations #2138

mnicely · 2025-02-26T17:09:13Z

mnicely
Feb 26, 2025
Maintainer

Deepseek's FlashMLA kernel represents a significant leap forward in AI inference efficiency. Optimized for NVIDIA's Hopper GPU architecture, it achieves remarkable performance metrics:

3000 GB/s memory bandwidth
580 TFLOPS computational throughput on H800 GPUs
40-60% reduced memory consumption compared to traditional attention mechanisms

Their FP8 innovations are equally impressive, featuring fine-grained quantization strategies that address common challenges like overflows and underflows in FP8 formats:

Activations grouped and scaled on a 1x128 tile basis
Weights grouped and scaled on a 128x128 block basis
Unified E4M3 format usage throughout the model

These optimizations have enabled Deepseek to achieve up to 3x throughput and 10x memory capacity improvements in their models.

CUTLASS Integration: Expanding Access to Innovation

To help the community access these groundbreaking kernels, we've created a dedicated branch in CUTLASS called "Deepseek." This allows anyone who clones CUTLASS to have immediate access to Deepseek's MLA and DeepGEMM code with these optimized kernels.
Additionally, we'll be releasing our own CUTLASS-native variants optimized for Blackwell architectures, that will be integrated into vLLM and SGLang:

FP8 fine-grained GEMM
FP8 fine-grained Grouped GEMM
MLA

Community Benefits: More Options, Better Performance

Tri Dao's work on Flash Attention has been foundational in advancing efficient attention mechanisms, and this MLA implementation provides yet another valuable option for the AI community alongside DeepSeek's many other contributions. The growing ecosystem of optimized attention mechanisms, including both FlashMLA and Flash Attention, gives developers more choices to find the best performance for their specific workloads.

The availability of SOTA performance attention kernels through multiple channels provides the AI community with valuable options for optimizing their workloads. Deepseek's implementations have already shown transformative results across multiple domains.

Join Us in Development

We invite the community to explore both Deepseek's original implementations and our CUTLASS-integrated versions. By providing feedback, and contributing to these open-source projects, you'll help advance the state of the art in AI acceleration technology. The combination of Deepseek's innovative approaches and CUTLASS's proven framework creates exciting possibilities for the future of AI model optimization.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Deepseek's Groundbreaking MLA and FP8 GEMM Innovations #2138

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 0 comments

Select a reply

Deepseek's Groundbreaking MLA and FP8 GEMM Innovations #2138

mnicely Feb 26, 2025 Maintainer

CUTLASS Integration: Expanding Access to Innovation

Community Benefits: More Options, Better Performance

Join Us in Development

Replies: 0 comments

mnicely
Feb 26, 2025
Maintainer