Skip to content

Improve batch_norm backward performance for automatically generated backward #182

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

IvanYashchuk
Copy link
Collaborator

@IvanYashchuk IvanYashchuk commented Apr 15, 2024

Base PR #139 (this PR will remain in draft mode before merging 139)

This PR reorders some operations to make disabling bookend optimization feasible without hitting a bug in nvFuser (NVIDIA/Fuser#1964).

nv_enable_bookend is set to True by default. nvFuser generates a single kernel for batch norm backward when the bookend optimization is turned off. To force the nvFuser executor to skip using this optimization I inserted a redundant dtype conversion in the var_mean backward before expanding mean.
Before this change (on #139), 3 nvFuser kernels:

nsys nvprof pytest thunder/benchmarks/targets.py -k "test_batch_norm_grad[thunder]"

Executing 'cuda_gpu_kern_sum' stats report

 Time (%)  Total Time (ns)  Instances  Avg (ns)  Med (ns)  Min (ns)  Max (ns)  StdDev (ns)                                                  Name                                                
 --------  ---------------  ---------  --------  --------  --------  --------  -----------  ----------------------------------------------------------------------------------------------------
     28.6        2,655,400        220  12,070.0  12,064.0    11,905    13,952        146.0  <unnamed>::nvfuser_reduction_f1_c1_r0_g1(<unnamed>::Tensor<<unnamed>::__bfloat, (int)3, (int)3>, <u…
     21.7        2,015,528      1,100   1,832.3   1,248.0     1,184     4,257      1,186.2  void at::native::<unnamed>::distribution_elementwise_grid_stride_kernel<float, (int)4, void at::nat…
     19.0        1,767,058        220   8,032.1   8,032.0     7,936     8,192         40.6  <unnamed>::nvfuser_pointwise_f1_c1_r0_g0(<unnamed>::Tensor<<unnamed>::__bfloat, (int)3, (int)3>, <u…
     16.6        1,540,891        220   7,004.1   7,008.0     6,817     7,520         66.1  <unnamed>::nvfuser_inner_persistent_f0_c1_r0_g0(<unnamed>::Tensor<<unnamed>::__bfloat, (int)3, (int…
      8.1          749,127        880     851.3     864.0       800       896         17.9  void at::native::vectorized_elementwise_kernel<(int)4, at::native::FillFunctor<long>, at::detail::A…
      6.0          560,484        220   2,547.7   2,560.0     2,527     2,592         17.1  void at::native::vectorized_elementwise_kernel<(int)4, at::native::FillFunctor<c10::BFloat16>, at::…

Current PR (2 nvFuser kernels, one for forward, one for backward):

nsys nvprof pytest thunder/benchmarks/targets.py -k "test_batch_norm_grad[thunder]"

Executing 'cuda_gpu_kern_sum' stats report

 Time (%)  Total Time (ns)  Instances  Avg (ns)  Med (ns)  Min (ns)  Max (ns)  StdDev (ns)                                                  Name                                                
 --------  ---------------  ---------  --------  --------  --------  --------  -----------  ----------------------------------------------------------------------------------------------------
     40.5        2,008,029      1,100   1,825.5   1,248.0     1,215     4,256      1,174.6  void at::native::<unnamed>::distribution_elementwise_grid_stride_kernel<float, (int)4, void at::nat…
     26.9        1,334,931        220   6,067.9   6,049.0     5,984     6,368         33.7  <unnamed>::nvfuser_inner_persistent_f1_c1_r0_g0(<unnamed>::Tensor<<unnamed>::__bfloat, (int)3, (int…
     21.4        1,063,366        220   4,833.5   4,832.0     4,735     5,216         66.2  <unnamed>::nvfuser_inner_persistent_f0_c1_r0_g0(<unnamed>::Tensor<<unnamed>::__bfloat, (int)3, (int…
     11.2          557,249        220   2,533.0   2,528.0     2,496     2,592         12.8  void at::native::vectorized_elementwise_kernel<(int)4, at::native::FillFunctor<c10::BFloat16>, at::…

Main before #139 has the same 2 nvFuser kernels, one for forward, one for backward.

Comment on lines 1154 to 1157
# Inserting a conversion to the same dtype to disable nvFuser's bookend
# optimization, which can cause the backward pass to generate two kernels
mean_mdtype = prims.convert_element_type(m, m.dtype)
restored_mean = restore_reduced_dims(mean_mdtype, dims, a.shape)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change wouldn't be needed if bookend optimization was disabled by default.

Comment on lines 2543 to 2549
# Converting weight and bias in the computation_dtype so that nvFuser
# can't push out the reshape outside of the fusion region
weight = to(weight, computation_dtype)
weight = reshape(weight, params_shape)
out = out * weight
if bias is not None:
bias = to(bias, computation_dtype)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change wouldn't be needed if bookend optimization was disabled by default.

Copy link
Collaborator

@jjsjann123 jjsjann123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👏

@jjsjann123
Copy link
Collaborator

Base PR #139 (this PR will remain in draft mode before merging 139)

Should this PR be merged into #139 first before we land that? Just so we are not having a bad commit with perf regression.

@IvanYashchuk
Copy link
Collaborator Author

We should embrace small diffs and stacked commits!

But we can also merge this PR into the preceding one first if it makes you worry less.

@IvanYashchuk IvanYashchuk deleted the branch Lightning-AI:main April 16, 2024 09:06
@IvanYashchuk IvanYashchuk reopened this Apr 16, 2024
@IvanYashchuk IvanYashchuk changed the base branch from bn_decompose_fwd to main April 16, 2024 11:25
@IvanYashchuk
Copy link
Collaborator Author

GitHub behaves weirdly with automerge enabled and stacked PRs.

@IvanYashchuk IvanYashchuk marked this pull request as ready for review April 16, 2024 11:27
Copy link
Collaborator

@kiya00 kiya00 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you very much @IvanYashchuk

Copy link
Collaborator

@t-vi t-vi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@t-vi t-vi enabled auto-merge (squash) April 16, 2024 13:51
@t-vi t-vi disabled auto-merge April 16, 2024 13:51
@t-vi t-vi merged commit e054d43 into Lightning-AI:main Apr 16, 2024
@github-actions github-actions bot deleted the bn_decompose_fwd-reorder-for-bookend branch July 17, 2024 00:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants