fix: auto-fallback to flash_attn for Qwen3.5 on pre-Hopper GPUs (head_dim=256) by dadiaomengmeimei · Pull Request #1808 · THUDM/slime

dadiaomengmeimei · 2026-04-06T11:36:47Z

Problem

Qwen3.5 uses head_dim=256 for its full-attention layers. Transformer Engine's DotProductAttention only supports head_dim <= 128 on GPUs with compute capability < 9.0 (pre-Hopper architectures such as L20, A100, RTX 4090, etc.), causing runtime errors.

Solution

Add runtime GPU capability detection in get_qwen3_5_spec():

_te_supports_head_dim() – checks whether the current GPU's compute capability supports the model's head_dim in TE.
_replace_core_attention_in_spec() – recursively replaces core_attention in the decoder block spec with FlashDotProductAttention (which uses flash_attn directly and bypasses TE).
A warning is logged when the fallback is activated.

This change is transparent to Hopper/Blackwell (sm_90+) users – on those GPUs, TE is used as before with no changes.

Changes

slime_plugins/models/qwen3_5.py: Added _te_supports_head_dim(), _replace_core_attention_in_spec(), and the fallback logic in get_qwen3_5_spec().

Testing

Verified on L20 (sm_89, compute capability 8.9) with Qwen3.5-2B – training runs successfully with the fallback to FlashDotProductAttention.

…_dim=256)

zhuzilin · 2026-04-07T06:58:03Z

I wonder if setting --attention-backend flash works.

dadiaomengmeimei · 2026-04-07T07:04:42Z

I wonder if setting --attention-backend flash works.

--attention-backend flash alone doesn't work because get_qwen3_5_spec() calls get_gpt_decoder_block_spec(config, use_transformer_engine=True), which hardcodes TE's DotProductAttention as core_attention in the layer spec for full_attention layers. The --attention-backend CLI flag doesn't override spec-level core_attention bindings. On pre-Hopper GPUs (sm < 90), TE's kernels don't support head_dim=256 (Qwen3.5's kv_channels), so we need to explicitly replace core_attention with FlashDotProductAttention in the spec — which is exactly what this PR does, with a runtime GPU capability check so H100/H200 users are unaffected.

fix: auto-fallback to flash_attn for Qwen3.5 on pre-Hopper GPUs (head…

685d534

…_dim=256)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: auto-fallback to flash_attn for Qwen3.5 on pre-Hopper GPUs (head_dim=256)#1808

fix: auto-fallback to flash_attn for Qwen3.5 on pre-Hopper GPUs (head_dim=256)#1808
dadiaomengmeimei wants to merge 1 commit intoTHUDM:mainfrom
dadiaomengmeimei:fix/qwen3_5-pre-hopper-gpu-fallback

dadiaomengmeimei commented Apr 6, 2026

Uh oh!

zhuzilin commented Apr 7, 2026

Uh oh!

dadiaomengmeimei commented Apr 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

dadiaomengmeimei commented Apr 6, 2026

Problem

Solution

Changes

Testing

Uh oh!

zhuzilin commented Apr 7, 2026

Uh oh!

dadiaomengmeimei commented Apr 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants