Feat/blackwell sm100 support#2670
Merged
Jiang-Jia-Jun merged 5 commits intoPaddlePaddle:developfrom Jul 9, 2025
Merged
Conversation
This change introduces initial support for the NVIDIA Blackwell GPU
architecture, specifically targeting SM100 (Compute Capability 10.x)
with '100a' architecture-specific features (e.g., for CUTLASS).
Key changes:
- Updated custom_ops/setup_ops.py to generate appropriate gencode
flags (arch=compute_100a,code=sm_100a) when '100' is specified
in FD_BUILDING_ARCS. Requires CUDA 12.9+.
- Updated custom_ops/gpu_ops/cutlass_extensions/gemm_configs.h:
- Added CutlassTileConfigSM100 enum (with placeholder tile shapes).
- Added BLACKWELL to CandidateConfigTypeParam.
- Updated CutlassGemmConfig struct with is_sm100 flag,
tile_config_sm100, and new constructor for SM100.
- Modified toString() and fromString() for SM100 support.
- Updated custom_ops/gpu_ops/cutlass_kernels/cutlass_heuristic.cu:
- Added get_candidate_tiles_sm100() (with placeholder tiles).
- Added placeholder mcast support functions for SM100.
- Updated get_candidate_configs() to include SM100 paths using
the BLACKWELL flag and new SM100 config types.
- Updated build.sh with comments to guide users on specifying '100'
for Blackwell in FD_BUILDING_ARCS.
Further work:
- Optimal CUTLASS tile configurations for SM100 need to be researched
and updated in cutlass_heuristic.cu.
- Kernel auto-generation scripts in custom_ops/utils/ may need
SM100-specific versions if Blackwell's hardware features for FP8/TMA
differ significantly from SM90.
- Compatibility of third-party libraries (CUTLASS v3.8.0, DeepGEMM)
with Blackwell should be fully verified.
This change integrates specific, expert-provided CUTLASS heuristic
configurations for the NVIDIA Blackwell (SM100) GPU architecture,
replacing previous placeholders. This includes:
- Updated `custom_ops/gpu_ops/cutlass_extensions/gemm_configs.h`:
- Populated `CutlassTileConfigSM100` enum with specific tile shapes
(e.g., CtaShape64x64x128B, CtaShape128x128x128B) suitable for SM100.
- Added `FP4_ONLY` to `CandidateConfigTypeParam` for new FP4 paths.
- Updated `custom_ops/gpu_ops/cutlass_kernels/cutlass_heuristic.cu`:
- Implemented `get_candidate_tiles_sm100` with detailed logic for
selecting tile configurations based on GROUPED_GEMM and FP4_ONLY flags,
using the new SM100 tile enums.
- Implemented `supports_mcast_along_m_sm100` and
`supports_mcast_along_n_sm100` with specific tile checks for Blackwell.
- Updated the `sm == 100` (Blackwell) block in `get_candidate_configs`
to use these new helper functions and accurately populate candidate
kernel configurations for various cluster shapes.
- `custom_ops/setup_ops.py` remains configured to compile for
`arch=compute_100a,code=sm_100a` with CUDA 12.9+ for these features.
This aligns the codebase with heuristic configurations similar to those
in upstream TensorRT-LLM / CUTLASS for Blackwell, enabling more
performant kernel selection on this new architecture.
|
|
|
Thanks for your contribution! |
vivienfanghuagood
approved these changes
Jul 3, 2025
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This change integrates specific, expert-provided CUTLASS heuristic
configurations for the NVIDIA Blackwell (SM100) GPU architecture,
replacing previous placeholders. This includes:
Updated
custom_ops/gpu_ops/cutlass_extensions/gemm_configs.h:CutlassTileConfigSM100enum with specific tile shapes(e.g., CtaShape64x64x128B, CtaShape128x128x128B) suitable for SM100.
FP4_ONLYtoCandidateConfigTypeParamfor new FP4 paths.Updated
custom_ops/gpu_ops/cutlass_kernels/cutlass_heuristic.cu:get_candidate_tiles_sm100with detailed logic forselecting tile configurations based on GROUPED_GEMM and FP4_ONLY flags,
using the new SM100 tile enums.
supports_mcast_along_m_sm100andsupports_mcast_along_n_sm100with specific tile checks for Blackwell.sm == 100(Blackwell) block inget_candidate_configsto use these new helper functions and accurately populate candidate
kernel configurations for various cluster shapes.
custom_ops/setup_ops.pyremains configured to compile forarch=compute_100a,code=sm_100awith CUDA 12.9+ for these features.This aligns the codebase with heuristic configurations similar to those
in upstream TensorRT-LLM / CUTLASS for Blackwell, enabling more
performant kernel selection on this new architecture.