sdpa: support attn_mask.requires_grad, support expanded number of heads in attn_mask #1563

kiya00 · 2024-12-17T12:20:38Z

Before submitting

Was this discussed/approved via a Github issue? (no need for typos and docs improvements)
Did you read the contributor guideline, Pull Request section?
Did you make sure to update the docs?
Did you write any new necessary tests?

What does this PR do?

PR review

Anyone in the community is free to review the PR once the tests have passed.
If we didn't discuss your PR in Github issues there's a high chance it will not be merged.

Did you have fun?

Make sure you had fun coding 🙃

…ds in attn_mask (#1482)

t-vi

Thank you @kiya00

IvanYashchuk · 2024-12-18T12:19:25Z

thunder/executors/sdpaex.py

@@ -301,10 +301,12 @@ def _scaled_dot_product_efficient_attention_backward_impl(
    if attn_mask is None:
        grad_input_mask.append(False)
    else:
-        grad_input_mask.append(attn_mask.requires_grad)
+        # Cannot rely on the requires_grad in the meta function,


requires_grad of intermediate TensorProxies is ignored in our automatic differentiation code because we haven't done the work of properly threading this property through all computations.
We should remove the ability to query .requires_grad from intermediate TensorProxies completely to avoid similar bugs in the future. This can be achieved by introducing a separate "InputTensorProxy" which has this attribute and removing it from the regular TensorProxy.

sdpa: support attn_mask.requires_grad, support expanded number of hea…

e9d525e

…ds in attn_mask (#1482)

kiya00 requested a review from IvanYashchuk December 17, 2024 12:20

kiya00 marked this pull request as ready for review December 17, 2024 14:18

kiya00 requested review from mruberry, lantiga and t-vi as code owners December 17, 2024 14:18

t-vi approved these changes Dec 17, 2024

View reviewed changes

t-vi enabled auto-merge (squash) December 17, 2024 14:21

t-vi merged commit d3b2276 into main Dec 17, 2024
44 checks passed

t-vi deleted the fixsdpa branch December 17, 2024 14:21

IvanYashchuk reviewed Dec 18, 2024

View reviewed changes

IvanYashchuk mentioned this pull request Dec 18, 2024

"requires_grad" attribute on intermediate TensorProxies is unused and misleading #1570

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

sdpa: support attn_mask.requires_grad, support expanded number of heads in attn_mask #1563

sdpa: support attn_mask.requires_grad, support expanded number of heads in attn_mask #1563

kiya00 commented Dec 17, 2024

t-vi left a comment

IvanYashchuk Dec 18, 2024

sdpa: support attn_mask.requires_grad, support expanded number of heads in attn_mask #1563

sdpa: support attn_mask.requires_grad, support expanded number of heads in attn_mask #1563

Conversation

kiya00 commented Dec 17, 2024

What does this PR do?

PR review

Did you have fun?

t-vi left a comment

Choose a reason for hiding this comment

IvanYashchuk Dec 18, 2024

Choose a reason for hiding this comment