-
Notifications
You must be signed in to change notification settings - Fork 102
Improve batch_norm backward performance for automatically generated backward #182
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve batch_norm backward performance for automatically generated backward #182
Conversation
for more information, see https://pre-commit.ci
for more information, see https://pre-commit.ci
for more information, see https://pre-commit.ci
for more information, see https://pre-commit.ci
thunder/core/transforms.py
Outdated
# Inserting a conversion to the same dtype to disable nvFuser's bookend | ||
# optimization, which can cause the backward pass to generate two kernels | ||
mean_mdtype = prims.convert_element_type(m, m.dtype) | ||
restored_mean = restore_reduced_dims(mean_mdtype, dims, a.shape) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change wouldn't be needed if bookend optimization was disabled by default.
thunder/torch/__init__.py
Outdated
# Converting weight and bias in the computation_dtype so that nvFuser | ||
# can't push out the reshape outside of the fusion region | ||
weight = to(weight, computation_dtype) | ||
weight = reshape(weight, params_shape) | ||
out = out * weight | ||
if bias is not None: | ||
bias = to(bias, computation_dtype) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change wouldn't be needed if bookend optimization was disabled by default.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👏
We should embrace small diffs and stacked commits! But we can also merge this PR into the preceding one first if it makes you worry less. |
GitHub behaves weirdly with automerge enabled and stacked PRs. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you very much @IvanYashchuk
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @IvanYashchuk @kiya00 @jjsjann123
Base PR #139 (this PR will remain in draft mode before merging 139)
This PR reorders some operations to make disabling bookend optimization feasible without hitting a bug in nvFuser (NVIDIA/Fuser#1964).
nv_enable_bookend
is set to True by default. nvFuser generates a single kernel for batch norm backward when the bookend optimization is turned off. To force the nvFuser executor to skip using this optimization I inserted a redundant dtype conversion in the var_mean backward before expanding mean.Before this change (on #139), 3 nvFuser kernels:
Current PR (2 nvFuser kernels, one for forward, one for backward):
Main before #139 has the same 2 nvFuser kernels, one for forward, one for backward.