Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sdpa: support attn_mask.requires_grad, support expanded number of heads in attn_mask #1563

Merged
merged 1 commit into from
Dec 17, 2024

Conversation

kiya00
Copy link
Collaborator

@kiya00 kiya00 commented Dec 17, 2024

Before submitting
  • Was this discussed/approved via a Github issue? (no need for typos and docs improvements)
  • Did you read the contributor guideline, Pull Request section?
  • Did you make sure to update the docs?
  • Did you write any new necessary tests?

What does this PR do?

Fixes #1482.

PR review

Anyone in the community is free to review the PR once the tests have passed.
If we didn't discuss your PR in Github issues there's a high chance it will not be merged.

Did you have fun?

Make sure you had fun coding 🙃

@kiya00 kiya00 requested a review from IvanYashchuk December 17, 2024 12:20
@kiya00 kiya00 marked this pull request as ready for review December 17, 2024 14:18
Copy link
Collaborator

@t-vi t-vi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @kiya00

@t-vi t-vi enabled auto-merge (squash) December 17, 2024 14:21
@t-vi t-vi merged commit d3b2276 into main Dec 17, 2024
44 checks passed
@t-vi t-vi deleted the fixsdpa branch December 17, 2024 14:21
@@ -301,10 +301,12 @@ def _scaled_dot_product_efficient_attention_backward_impl(
if attn_mask is None:
grad_input_mask.append(False)
else:
grad_input_mask.append(attn_mask.requires_grad)
# Cannot rely on the requires_grad in the meta function,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

requires_grad of intermediate TensorProxies is ignored in our automatic differentiation code because we haven't done the work of properly threading this property through all computations.
We should remove the ability to query .requires_grad from intermediate TensorProxies completely to avoid similar bugs in the future. This can be achieved by introducing a separate "InputTensorProxy" which has this attribute and removing it from the regular TensorProxy.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[ThunderFX][HF] ValueError: unrecognized type in arguments: <class 'NoneType'>
3 participants