You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
hi @cyr0930, in general, there are three main attention implementations supported by models in transformers:
eager: Attention algorithm written by hand in the model code using basic linear algebra operations sdpa: uses scaled_dot_product_attention() from torch flash_attention_2: uses the flashattn library
Qwen2 already supports sliding window attention with flash_attention_2 but I believe it should be possible to add sliding window support to Qwen2 with either SDPA or eager, based on the implementations in other models
Feature request
Can we implement sliding_window support in here (https://github.com/huggingface/transformers/blob/v4.49.0/src/transformers/models/qwen2/modeling_qwen2.py#L237) like this (https://github.com/fxmarty/transformers/blob/383df6ced45be4c4ffc4c3b7616519b67369b00e/src/transformers/models/mistral/modeling_mistral.py#L1004)?
Or is this responsible for torch? (https://pytorch.org/docs/stable/generated/torch.nn.functional.scaled_dot_product_attention.html)
Motivation
Sliding window for sdpa is supported with mistral but qwen2
Your contribution
maybe I can submit PR?
The text was updated successfully, but these errors were encountered: