You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Given that from PyTorch 2.0 the dynamic dispatch to FlashAttention happens if the required conditions satisfy, I do not find a way to ensure whether FlashAttention is used by default. Also due to the HF dependency for general GPT recipes, which do not seem to use the F.scaled_dot_product_attention method of PyTorch, I am wondering if FlashAttention will really be used while using composer. Any ideas on how to easily enabled usage of FlashAttention while using HF model along with composer ?
The text was updated successfully, but these errors were encountered:
Hey, we'd recommend that you use our llm-foundry repo, which uses composer extensively and also supports using HF models. Check it out here!
Hi @snarayan21 thanks for the response. This however does not answer my original question. Even in LLM foundry if we are using HuggingFace for model recipes, I do not see a functionality where the attention layer computation is ensured to go via 'F.scaled_dot_product_attention' method of PyTorch which is what ensures to dispatch to either FlashAttention and MemEfficientAttention if possible for the current model parameters. Any insights into this ?
Hey so there are three cases you'll have when using llm-foundry:
First, using an MPT model. This has configurable attention, and supports flash attention.
Second, using a Llama model. There is an option to patch in flash attention as configured in llm-foundry.
Third, using a HuggingFace model. Foundry will use whatever attention implementation the underlying HuggingFace model uses.
You can see our attention implementations in foundry in this folder. Hope this helps!
Given that from PyTorch 2.0 the dynamic dispatch to FlashAttention happens if the required conditions satisfy, I do not find a way to ensure whether FlashAttention is used by default. Also due to the HF dependency for general GPT recipes, which do not seem to use the
F.scaled_dot_product_attention
method of PyTorch, I am wondering if FlashAttention will really be used while using composer. Any ideas on how to easily enabled usage of FlashAttention while using HF model along with composer ?The text was updated successfully, but these errors were encountered: