fix: auto-fallback to flash_attn for Qwen3.5 on pre-Hopper GPUs (head_dim=256)#1808
Open
dadiaomengmeimei wants to merge 1 commit intoTHUDM:mainfrom
Open
fix: auto-fallback to flash_attn for Qwen3.5 on pre-Hopper GPUs (head_dim=256)#1808dadiaomengmeimei wants to merge 1 commit intoTHUDM:mainfrom
dadiaomengmeimei wants to merge 1 commit intoTHUDM:mainfrom
Conversation
Contributor
|
I wonder if setting |
Author
--attention-backend flash alone doesn't work because get_qwen3_5_spec() calls get_gpt_decoder_block_spec(config, use_transformer_engine=True), which hardcodes TE's DotProductAttention as core_attention in the layer spec for full_attention layers. The --attention-backend CLI flag doesn't override spec-level core_attention bindings. On pre-Hopper GPUs (sm < 90), TE's kernels don't support head_dim=256 (Qwen3.5's kv_channels), so we need to explicitly replace core_attention with FlashDotProductAttention in the spec — which is exactly what this PR does, with a runtime GPU capability check so H100/H200 users are unaffected. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
Qwen3.5 uses
head_dim=256for its full-attention layers. Transformer Engine'sDotProductAttentiononly supportshead_dim <= 128on GPUs with compute capability < 9.0 (pre-Hopper architectures such as L20, A100, RTX 4090, etc.), causing runtime errors.Solution
Add runtime GPU capability detection in
get_qwen3_5_spec():_te_supports_head_dim()– checks whether the current GPU's compute capability supports the model's head_dim in TE._replace_core_attention_in_spec()– recursively replacescore_attentionin the decoder block spec withFlashDotProductAttention(which usesflash_attndirectly and bypasses TE).This change is transparent to Hopper/Blackwell (sm_90+) users – on those GPUs, TE is used as before with no changes.
Changes
slime_plugins/models/qwen3_5.py: Added_te_supports_head_dim(),_replace_core_attention_in_spec(), and the fallback logic inget_qwen3_5_spec().Testing
Verified on L20 (sm_89, compute capability 8.9) with Qwen3.5-2B – training runs successfully with the fallback to
FlashDotProductAttention.