Skip to content

fix: auto-fallback to flash_attn for Qwen3.5 on pre-Hopper GPUs (head_dim=256)#1808

Open
dadiaomengmeimei wants to merge 1 commit intoTHUDM:mainfrom
dadiaomengmeimei:fix/qwen3_5-pre-hopper-gpu-fallback
Open

fix: auto-fallback to flash_attn for Qwen3.5 on pre-Hopper GPUs (head_dim=256)#1808
dadiaomengmeimei wants to merge 1 commit intoTHUDM:mainfrom
dadiaomengmeimei:fix/qwen3_5-pre-hopper-gpu-fallback

Conversation

@dadiaomengmeimei
Copy link
Copy Markdown

Problem

Qwen3.5 uses head_dim=256 for its full-attention layers. Transformer Engine's DotProductAttention only supports head_dim <= 128 on GPUs with compute capability < 9.0 (pre-Hopper architectures such as L20, A100, RTX 4090, etc.), causing runtime errors.

Solution

Add runtime GPU capability detection in get_qwen3_5_spec():

  1. _te_supports_head_dim() – checks whether the current GPU's compute capability supports the model's head_dim in TE.
  2. _replace_core_attention_in_spec() – recursively replaces core_attention in the decoder block spec with FlashDotProductAttention (which uses flash_attn directly and bypasses TE).
  3. A warning is logged when the fallback is activated.

This change is transparent to Hopper/Blackwell (sm_90+) users – on those GPUs, TE is used as before with no changes.

Changes

  • slime_plugins/models/qwen3_5.py: Added _te_supports_head_dim(), _replace_core_attention_in_spec(), and the fallback logic in get_qwen3_5_spec().

Testing

Verified on L20 (sm_89, compute capability 8.9) with Qwen3.5-2B – training runs successfully with the fallback to FlashDotProductAttention.

@zhuzilin
Copy link
Copy Markdown
Contributor

zhuzilin commented Apr 7, 2026

I wonder if setting --attention-backend flash works.

@dadiaomengmeimei
Copy link
Copy Markdown
Author

I wonder if setting --attention-backend flash works.

--attention-backend flash alone doesn't work because get_qwen3_5_spec() calls get_gpt_decoder_block_spec(config, use_transformer_engine=True), which hardcodes TE's DotProductAttention as core_attention in the layer spec for full_attention layers. The --attention-backend CLI flag doesn't override spec-level core_attention bindings. On pre-Hopper GPUs (sm < 90), TE's kernels don't support head_dim=256 (Qwen3.5's kv_channels), so we need to explicitly replace core_attention with FlashDotProductAttention in the spec — which is exactly what this PR does, with a runtime GPU capability check so H100/H200 users are unaffected.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants