Support sliding_window for sdpa in qwen2 #36351

cyr0930 · 2025-02-23T07:07:09Z

Feature request

Can we implement sliding_window support in here (https://github.com/huggingface/transformers/blob/v4.49.0/src/transformers/models/qwen2/modeling_qwen2.py#L237) like this (https://github.com/fxmarty/transformers/blob/383df6ced45be4c4ffc4c3b7616519b67369b00e/src/transformers/models/mistral/modeling_mistral.py#L1004)?

Or is this responsible for torch? (https://pytorch.org/docs/stable/generated/torch.nn.functional.scaled_dot_product_attention.html)

Motivation

Sliding window for sdpa is supported with mistral but qwen2

Your contribution

maybe I can submit PR?

Rocketknight1 · 2025-02-24T14:40:25Z

hi @cyr0930, in general, there are three main attention implementations supported by models in transformers:

eager: Attention algorithm written by hand in the model code using basic linear algebra operations
sdpa: uses scaled_dot_product_attention() from torch
flash_attention_2: uses the flashattn library

Qwen2 already supports sliding window attention with flash_attention_2 but I believe it should be possible to add sliding window support to Qwen2 with either SDPA or eager, based on the implementations in other models

cyr0930 added the Feature request Request for a new feature label Feb 23, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support sliding_window for sdpa in qwen2 #36351

Support sliding_window for sdpa in qwen2 #36351

cyr0930 commented Feb 23, 2025

Rocketknight1 commented Feb 24, 2025

Support sliding_window for sdpa in qwen2 #36351

Support sliding_window for sdpa in qwen2 #36351

Comments

cyr0930 commented Feb 23, 2025

Feature request

Motivation

Your contribution

Rocketknight1 commented Feb 24, 2025