Add device-Initiated Grouped GEMM supporting m_splits on device #2360

QiZhangNV · 2025-11-07T09:05:14Z

Description

Introduces a CUTLASS-based grouped GEMM implementation that reads m_splits directly on the device.

This optimization removes the need for device-to-host data transfers and synchronization in MCore, while allowing the number of quantization kernels to be reduced to one.

The kernel is fully compatible with CUDA Graphs.

Key points:
• Does not break the existing API. The operator now accepts m_splits as either a torch.Tensor (on CPU or GPU) or a Python list.
• Reduces CPU overhead, especially for large expert counts, by using a single quantization kernel instead of one per GEMM.
• Currently supports only MXFP8.

Type of change

Documentation change (change only to the documentation, either a fix or a new content)
Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Infra/Build change
Code refactoring

Changes

Please list the changes introduced in this PR:

Change m_splits from List[int] to torch.Tensor, but can still run correctly with List[int] (will be internally converted to a tensor)
Add te_general_device_initiated_grouped_gemm

Unit Test

pytest -v -s tests/pytorch/test_numerics.py::test_grouped_linear_accuracy_cutlass_device

Checklist:

I have read and followed the contributing guidelines
The functionality is complete
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes

Code clean & Add Check & Fix Arch Chnage cutlass submodule to https Support BF16 Fix assertion Fix when all local experts have no tokens Fix Support save_original_input for cutlass backend Fix remove cudaMallocAsync & modify CUTLASS config Pass nullptr if C is not needed Tune kernel Performance Add dtype check for m_split Optimize setGroupedGemmWgradArguments when fuse_wgrad_accumulation=false Support partial wgrad accumulate when using cutlass backend use torch.empty() instead of torch.zeros for wgrad_list Fix IMA when enable cuda graph Use agr wgrad_accumulation_mask to handle partial wgrad accumulate Use bitmap for partial wgrad accumulate to avoid cudaMemcpyAsync Allow m_splits to be List, convert to torch tensor Use pinned memory instead of pageable memory Refactor and add dispatcher

greptile-apps

Greptile Overview

Greptile Summary

This PR introduces device-initiated grouped GEMM support that eliminates CPU-GPU synchronization overhead for MoE (Mixture of Experts) workloads by reading m_splits directly on the device.

Key Changes:

Modified m_splits parameter from List[int] to torch.Tensor, maintaining backward compatibility by auto-converting lists
Added new CUTLASS-based kernel path (nvte_device_cutlass_grouped_gemm and nvte_device_cutlass_grouped_gemm_wgrad) for Blackwell GPUs (SM 10.0)
Implemented device-side argument preparation that reads m_splits tensor on GPU, avoiding D2H transfer
Added support for partial weight gradient accumulation via wgrad_accumulation_mask parameter
Uses pinned host memory buffer for CUDA Graph compatibility when transferring weight/scale factor addresses

Critical Issues Found:

Race condition in global buffer index (gemm.cpp:563): The pinned_host_buffer_index global variable lacks thread safety and never resets, causing buffer overflow after multiple calls
Multiple typos: Variable name m_splits_on_devie should be m_splits_on_device in several locations
Complex lambda expression (grouped_linear.py:265-267): Nested immediately-invoked lambda makes code maintenance difficult

Limitations:

Device-initiated path only supports MXFP8 format on Blackwell GPUs
No bias support when m_splits is on device
Requires m dimension alignment to 128 for MXFP8 (increased from 32)

Confidence Score: 2/5

This PR has critical concurrency bugs that will cause failures in production
The global pinned_host_buffer_index variable creates a race condition and memory corruption risk. Without proper synchronization or reset mechanism, the buffer index grows unbounded and will overflow workspace memory. Additionally, multiple typos in variable names indicate insufficient review/testing
transformer_engine/pytorch/csrc/extensions/gemm.cpp requires immediate attention to fix the global buffer index race condition before merging

Important Files Changed

File Analysis

Filename	Score	Overview
transformer_engine/pytorch/module/grouped_linear.py	3/5	Changed m_splits from List[int] to torch.Tensor, added device-initiated path with m_splits_on_device flag, complex lambda expression for conditional wgrad accumulation
transformer_engine/pytorch/csrc/extensions/gemm.cpp	2/5	Implements te_general_device_initiated_grouped_gemm with global pinned_host_buffer_index for CUDA Graph support, manages H2D copies for weight/SF addresses
transformer_engine/common/gemm/cutlass_device_grouped_gemm.cu	4/5	New CUTLASS kernel implementation for device-initiated grouped GEMM, supports fprop/dgrad/wgrad with MXFP8, includes WgradAccumulatePolicy for partial accumulation

Sequence Diagram

sequenceDiagram
    participant User
    participant GroupedLinear
    participant _GroupedLinear
    participant gemm.py
    participant gemm.cpp
    participant CUTLASS_CUDA
    
    User->>GroupedLinear: forward(inp, m_splits)
    Note over GroupedLinear: Convert m_splits to tensor if list
    GroupedLinear->>_GroupedLinear: forward()
    
    alt m_splits on device
        Note over _GroupedLinear: Single quantize (no split)
        _GroupedLinear->>_GroupedLinear: tex.split_quantize(inp, [total_size], quantizers[:1])
    else m_splits on CPU
        Note over _GroupedLinear: Split quantize per expert
        _GroupedLinear->>_GroupedLinear: tex.split_quantize(inp, m_splits.tolist(), quantizers)
    end
    
    _GroupedLinear->>gemm.py: general_grouped_gemm(A, B, out, m_splits, m_splits_on_device)
    
    alt m_splits_on_device
        gemm.py->>gemm.cpp: te_general_device_initiated_grouped_gemm()
        gemm.cpp->>gemm.cpp: Prepare B/SF address arrays in pinned memory
        gemm.cpp->>gemm.cpp: H2D copy (async, non-blocking)
        gemm.cpp->>CUTLASS_CUDA: nvte_device_cutlass_grouped_gemm()
        CUTLASS_CUDA->>CUTLASS_CUDA: setGroupedGemmArguments kernel (reads m_splits on device)
        CUTLASS_CUDA->>CUTLASS_CUDA: CUTLASS grouped GEMM kernel
    else m_splits on CPU
        gemm.py->>gemm.cpp: te_general_grouped_gemm()
        gemm.cpp->>gemm.cpp: cuBLAS multi-stream grouped GEMM
    end
    
    gemm.cpp-->>gemm.py: result
    gemm.py-->>_GroupedLinear: out
    _GroupedLinear-->>GroupedLinear: out
    GroupedLinear-->>User: result

_{11 files reviewed, 8 comments}

_{Edit Code Review Agent Settings | Greptile}

greptile-apps · 2025-11-07T09:09:38Z

transformer_engine/pytorch/csrc/extensions/gemm.cpp

+// reuse and overwrite the same buffer.
+// A global variable is used because the function doesn't know how many instances there are and which
+// instance is calling.
+int pinned_host_buffer_index = 0;


logic: global variable without synchronization creates race condition

The global pinned_host_buffer_index is not thread-safe. If multiple operators call this function concurrently (even from different streams), they will increment the same global variable, leading to:

Buffer corruption as operators overwrite each other's data

Index overflow beyond workspace bounds

Even with CUDA Graph capture being single-threaded per graph, eager mode or multiple graphs could cause issues.

Suggested change

int pinned_host_buffer_index = 0;

// Use atomic operations or per-instance tracking

std::atomic<int> pinned_host_buffer_index = 0;

Consider using thread-local storage or passing buffer offset through the call chain instead.

greptile-apps · 2025-11-07T09:09:39Z