Improve batch_norm backward performance for automatically generated backward #182

IvanYashchuk · 2024-04-15T12:08:40Z

Base PR #139 (this PR will remain in draft mode before merging 139)

This PR reorders some operations to make disabling bookend optimization feasible without hitting a bug in nvFuser (NVIDIA/Fuser#1964).

nv_enable_bookend is set to True by default. nvFuser generates a single kernel for batch norm backward when the bookend optimization is turned off. To force the nvFuser executor to skip using this optimization I inserted a redundant dtype conversion in the var_mean backward before expanding mean.
Before this change (on #139), 3 nvFuser kernels:

nsys nvprof pytest thunder/benchmarks/targets.py -k "test_batch_norm_grad[thunder]"

Executing 'cuda_gpu_kern_sum' stats report

 Time (%)  Total Time (ns)  Instances  Avg (ns)  Med (ns)  Min (ns)  Max (ns)  StdDev (ns)                                                  Name                                                
 --------  ---------------  ---------  --------  --------  --------  --------  -----------  ----------------------------------------------------------------------------------------------------
     28.6        2,655,400        220  12,070.0  12,064.0    11,905    13,952        146.0  <unnamed>::nvfuser_reduction_f1_c1_r0_g1(<unnamed>::Tensor<<unnamed>::__bfloat, (int)3, (int)3>, <u…
     21.7        2,015,528      1,100   1,832.3   1,248.0     1,184     4,257      1,186.2  void at::native::<unnamed>::distribution_elementwise_grid_stride_kernel<float, (int)4, void at::nat…
     19.0        1,767,058        220   8,032.1   8,032.0     7,936     8,192         40.6  <unnamed>::nvfuser_pointwise_f1_c1_r0_g0(<unnamed>::Tensor<<unnamed>::__bfloat, (int)3, (int)3>, <u…
     16.6        1,540,891        220   7,004.1   7,008.0     6,817     7,520         66.1  <unnamed>::nvfuser_inner_persistent_f0_c1_r0_g0(<unnamed>::Tensor<<unnamed>::__bfloat, (int)3, (int…
      8.1          749,127        880     851.3     864.0       800       896         17.9  void at::native::vectorized_elementwise_kernel<(int)4, at::native::FillFunctor<long>, at::detail::A…
      6.0          560,484        220   2,547.7   2,560.0     2,527     2,592         17.1  void at::native::vectorized_elementwise_kernel<(int)4, at::native::FillFunctor<c10::BFloat16>, at::…

Current PR (2 nvFuser kernels, one for forward, one for backward):

nsys nvprof pytest thunder/benchmarks/targets.py -k "test_batch_norm_grad[thunder]"

Executing 'cuda_gpu_kern_sum' stats report

 Time (%)  Total Time (ns)  Instances  Avg (ns)  Med (ns)  Min (ns)  Max (ns)  StdDev (ns)                                                  Name                                                
 --------  ---------------  ---------  --------  --------  --------  --------  -----------  ----------------------------------------------------------------------------------------------------
     40.5        2,008,029      1,100   1,825.5   1,248.0     1,215     4,256      1,174.6  void at::native::<unnamed>::distribution_elementwise_grid_stride_kernel<float, (int)4, void at::nat…
     26.9        1,334,931        220   6,067.9   6,049.0     5,984     6,368         33.7  <unnamed>::nvfuser_inner_persistent_f1_c1_r0_g0(<unnamed>::Tensor<<unnamed>::__bfloat, (int)3, (int…
     21.4        1,063,366        220   4,833.5   4,832.0     4,735     5,216         66.2  <unnamed>::nvfuser_inner_persistent_f0_c1_r0_g0(<unnamed>::Tensor<<unnamed>::__bfloat, (int)3, (int…
     11.2          557,249        220   2,533.0   2,528.0     2,496     2,592         12.8  void at::native::vectorized_elementwise_kernel<(int)4, at::native::FillFunctor<c10::BFloat16>, at::…

Main before #139 has the same 2 nvFuser kernels, one for forward, one for backward.

for more information, see https://pre-commit.ci

…eeze

…tion

IvanYashchuk · 2024-04-15T13:11:52Z

thunder/core/transforms.py

+    # Inserting a conversion to the same dtype to disable nvFuser's bookend
+    # optimization, which can cause the backward pass to generate two kernels
+    mean_mdtype = prims.convert_element_type(m, m.dtype)
+    restored_mean = restore_reduced_dims(mean_mdtype, dims, a.shape)


This change wouldn't be needed if bookend optimization was disabled by default.

IvanYashchuk · 2024-04-15T13:11:58Z

thunder/torch/__init__.py

+        # Converting weight and bias in the computation_dtype so that nvFuser
+        # can't push out the reshape outside of the fusion region
+        weight = to(weight, computation_dtype)
        weight = reshape(weight, params_shape)
        out = out * weight
    if bias is not None:
+        bias = to(bias, computation_dtype)


This change wouldn't be needed if bookend optimization was disabled by default.

thunder/torch/__init__.py

jjsjann123

👏

thunder/torch/__init__.py

jjsjann123 · 2024-04-15T17:28:12Z

Base PR #139 (this PR will remain in draft mode before merging 139)

Should this PR be merged into #139 first before we land that? Just so we are not having a bad commit with perf regression.

IvanYashchuk · 2024-04-15T18:51:02Z

We should embrace small diffs and stacked commits!

But we can also merge this PR into the preceding one first if it makes you worry less.

IvanYashchuk · 2024-04-16T11:26:09Z

GitHub behaves weirdly with automerge enabled and stacked PRs.

kiya00

Thank you very much @IvanYashchuk

t-vi

Thank you @IvanYashchuk @kiya00 @jjsjann123

kiya00 and others added 15 commits April 9, 2024 16:20

decompose batch norm forward

8c324ca

[pre-commit.ci] auto fixes from pre-commit.com hooks

75e08ae

for more information, see https://pre-commit.ci

fix accuracy issue of running_var; add test

c571fe7

[pre-commit.ci] auto fixes from pre-commit.com hooks

1e7a531

for more information, see https://pre-commit.ci

fix test

d41686c

Add batch_norm benchmark

f95eec9

[pre-commit.ci] auto fixes from pre-commit.com hooks

8656ff9

for more information, see https://pre-commit.ci

Remove backward transform

fb810df

Merge branch 'main' into bn_decompose_fwd

c7b4904

fix comments

07fbb97

[pre-commit.ci] auto fixes from pre-commit.com hooks

bff9007

for more information, see https://pre-commit.ci

Merge branch 'main' into bn_decompose_fwd

52899cb

Add dtype conversion before reshape to disable bookend optimization

e4b4a57

Use keepdim=False and expand mean and var once; instead of expand+squ…

fbc5a04

…eeze

Add a fake op before expand and unsqueeze to disable bookend optimiza…

9cace5e

…tion

IvanYashchuk added autograd nvfuser labels Apr 15, 2024

IvanYashchuk requested review from jjsjann123 and kiya00 April 15, 2024 12:08

IvanYashchuk commented Apr 15, 2024

View reviewed changes

thunder/torch/__init__.py Show resolved Hide resolved

Mention nv_enable_bookend in the comments

8bdb23d

jjsjann123 approved these changes Apr 15, 2024

View reviewed changes

thunder/torch/__init__.py Show resolved Hide resolved

jjsjann123 mentioned this pull request Apr 15, 2024

Decompose batch norm forward (based on #50) #139

Merged

IvanYashchuk deleted the branch Lightning-AI:main April 16, 2024 09:06

IvanYashchuk closed this Apr 16, 2024

IvanYashchuk reopened this Apr 16, 2024

IvanYashchuk changed the base branch from bn_decompose_fwd to main April 16, 2024 11:25

github-actions bot added the has conflicts label Apr 16, 2024

Merge branch 'main' into bn_decompose_fwd-reorder-for-bookend

e7134b4

IvanYashchuk marked this pull request as ready for review April 16, 2024 11:27

IvanYashchuk requested review from mruberry, lantiga, robieta, t-vi and carmocca as code owners April 16, 2024 11:27

github-actions bot removed the has conflicts label Apr 16, 2024

kiya00 approved these changes Apr 16, 2024

View reviewed changes

Update thunder/torch/__init__.py

7c909d6

t-vi approved these changes Apr 16, 2024

View reviewed changes

t-vi enabled auto-merge (squash) April 16, 2024 13:51

t-vi disabled auto-merge April 16, 2024 13:51

t-vi merged commit e054d43 into Lightning-AI:main Apr 16, 2024

github-actions bot deleted the bn_decompose_fwd-reorder-for-bookend branch July 17, 2024 00:33

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Improve batch_norm backward performance for automatically generated backward #182

Improve batch_norm backward performance for automatically generated backward #182

Uh oh!

IvanYashchuk commented Apr 15, 2024 •

edited

Loading

Uh oh!

IvanYashchuk Apr 15, 2024

Uh oh!

IvanYashchuk Apr 15, 2024

Uh oh!

Uh oh!

jjsjann123 left a comment

Uh oh!

Uh oh!

jjsjann123 commented Apr 15, 2024

Uh oh!

IvanYashchuk commented Apr 15, 2024

Uh oh!

IvanYashchuk commented Apr 16, 2024

Uh oh!

kiya00 left a comment

Uh oh!

t-vi left a comment

Uh oh!

Uh oh!

Improve batch_norm backward performance for automatically generated backward #182

Improve batch_norm backward performance for automatically generated backward #182

Uh oh!

Conversation

IvanYashchuk commented Apr 15, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

IvanYashchuk Apr 15, 2024

Choose a reason for hiding this comment

Uh oh!

IvanYashchuk Apr 15, 2024

Choose a reason for hiding this comment

Uh oh!

Uh oh!

jjsjann123 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

jjsjann123 commented Apr 15, 2024

Uh oh!

IvanYashchuk commented Apr 15, 2024

Uh oh!

IvanYashchuk commented Apr 16, 2024

Uh oh!

kiya00 left a comment

Choose a reason for hiding this comment

Uh oh!

t-vi left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

IvanYashchuk commented Apr 15, 2024 •

edited

Loading