New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Support cuDNN frontend scaled dot product attention for FP8. Part- 2(backward) #15331

Closed

wenscarl wants to merge 5 commits into openxla:main from wenscarl:sdpa_fp8_bwd

Contributor

wenscarl commented Jul 25, 2024 •

edited

Loading

As the 2nd part of #15092.
NOTE: this feature relies on cudnn-frontend v1.6.1 which is not in XLA yet.

wenscarl changed the title ~~Support cuDNN frontend scaled dot product attention for FP8. Part- 2(backward)~~ [draft]Support cuDNN frontend scaled dot product attention for FP8. Part- 2(backward)

wenscarl marked this pull request as draft

July 25, 2024 19:38

wenscarl force-pushed the sdpa_fp8_bwd branch from 21a8a43 to 949191a Compare

August 16, 2024 03:51

wenscarl requested review from philipphack and sergachev

August 16, 2024 03:58

wenscarl marked this pull request as ready for review

August 16, 2024 03:58

wenscarl changed the title ~~[draft]Support cuDNN frontend scaled dot product attention for FP8. Part- 2(backward)~~ Support cuDNN frontend scaled dot product attention for FP8. Part- 2(backward)

sergachev reviewed

View reviewed changes

xla/stream_executor/cuda/cuda_dnn.cc Outdated Show resolved Hide resolved

sergachev reviewed

View reviewed changes

xla/stream_executor/cuda/cuda_dnn.cc Outdated Show resolved Hide resolved

sergachev reviewed

View reviewed changes

xla/stream_executor/cuda/cuda_dnn.cc Outdated Show resolved Hide resolved

sergachev reviewed

View reviewed changes

xla/stream_executor/cuda/cuda_dnn.h Show resolved Hide resolved

sergachev approved these changes

View reviewed changes

dimitar-asenov requested a review from bchetioui

September 16, 2024 13:59

Member

bchetioui commented Sep 27, 2024

Given that cuDNN's FlashAttention is meant to remain behind a flag (as discussed previously), I wonder whether it still makes sense to integrate this within XLA.

I believe that we should already support calls to scaled dot product attention through JAX directly, is that correct?

Contributor Author

wenscarl commented Sep 30, 2024

Given that cuDNN's FlashAttention is meant to remain behind a flag (as discussed previously), I wonder whether it still makes sense to integrate this within XLA.

I believe that we should already support calls to scaled dot product attention through JAX directly, is that correct?

Do you refer to jax-ml/jax#22670? If so, jax's sdpa api still calls cudnn sdpa from XLA behind the scene. Plus, the forward pass PR is already merged.

Member

bchetioui commented Sep 30, 2024

Do you refer to jax-ml/jax#22670? If so, jax's sdpa api still calls cudnn sdpa from XLA behind the scene. Plus, the forward pass PR is already merged.

You're right, that seems reasonable, thanks for the clarification.

bchetioui suggested changes

View reviewed changes

xla/service/gpu/cublas_cudnn.cc Outdated Show resolved Hide resolved

xla/service/gpu/tests/gpu_fused_mha_test.cc Outdated

+                  XlaBuilder builder(TestName());
+                  std::string hlo_string_ref =
+                      R"(
+                  HloModule jit__unnamed_wrapped_function_, entry_computation_layout={(bf16[4,16,4,16]{3,2,1,0}, bf16[4,16,4,16]{3,2,1,0}, bf16[4,16,4,16]{3,2,1,0})->bf16[4,4,16,16]{3,1,2,0}}, allow_spmd_sharding_propagation_to_parameters={true,true,true}, allow_spmd_sharding_propagation_to_output={true}

Member

bchetioui Sep 30, 2024

Can you explain what is the purpose of all the HLO leading to the custom call? Does it actually provide value? If not, should we just compare the upcasted custom call vs the reference one?

Contributor Author

wenscarl Oct 7, 2024

The HLO leading to the custom-call performs a "cast-to-representable" operation, which adjusts the input to fit within the range that the FP8 data type can represent. Therefore, it's also necessary for the reference implementation to include this step in order to maintain numerical equivalence.

xla/service/gpu/tests/gpu_fused_mha_test.cc Outdated

+                    ROOT out = bf16[4,4,16,16]{3,1,2,0} convert(get-tuple-element.5.0)
+                  } // main.106
+                )";  // NOLINT
+                  EXPECT_TRUE(RunAndCompareTwoModules(hlo_string, hlo_string_ref,

Member

bchetioui Sep 30, 2024

Do we need to run HLO passes here? If not, let's disable them in this call. (/*run_hlo_passes=*/false).

Contributor Author

wenscarl Oct 7, 2024

Yes. The input HLOs are before optimization ones.

Member

bchetioui Oct 8, 2024

The purpose of the test seems to be to test that the emitted custom call is correct, right? Presumably, we are not trying to test the end-to-end compilation pipeline---or are we testing anything particularly useful here by running it?

Let's make sure to use already-optimized HLO here and in the other cases where we try to ensure correctness of the custom call, and only use before-optimizations HLO where it is necesary.

xla/service/gpu/tests/gpu_fused_mha_test.cc Outdated

+                    GTEST_SKIP() << "Flash Attention requires cuDNN >= 9.1.0.";
+                  }
+                  XlaBuilder builder(TestName());
+                  // generate padding mask in cuDNN directly

Member

bchetioui Sep 30, 2024

Are we actually doing any pattern matching here?

Contributor Author

wenscarl Oct 7, 2024

No patten matching here.

xla/service/gpu/tests/gpu_fused_mha_test.cc Outdated

+                  // XLA pattern match does not support pattern matching padding mask
+                  // so directly lower to custom call instead for reference
+                  std::string hlo_string_ref = R"(
+                  HloModule jit__unnamed_wrapped_function_, entry_computation_layout={(bf16[1,1,256,128]{3,2,1,0}, bf16[1,1,256,128]{3,2,1,0}, bf16[1,1,256,128]{3,2,1,0}, bf16[1,1,256,128]{3,2,1,0}, bf16[1,1,256,128]{3,2,1,0})->bf16[1,1,256,128]{3,1,2,0}}, allow_spmd_sharding_propagation_to_parameters={true,true,true,true,true}, allow_spmd_sharding_propagation_to_output={true}

Member

bchetioui Sep 30, 2024

Same questions as above for the HLOs.

xla/service/gpu/transforms/cudnn_custom_call_compiler.cc Outdated Show resolved Hide resolved

xla/service/gpu/transforms/cudnn_custom_call_compiler.cc Outdated Show resolved Hide resolved

Member

bchetioui commented Oct 7, 2024

Gentle ping @wenscarl :)


          Scaled dot product attention implementation by cudnn.

06db3c8

wenscarl force-pushed the sdpa_fp8_bwd branch from 1c98995 to 06db3c8 Compare

October 7, 2024 21:43

wenscarl added 2 commits

October 7, 2024 22:20


          Improve after review 1

937b0e2


          clang-format

398b2ba

wenscarl requested a review from bchetioui

October 7, 2024 22:28


          fix typo.

bchetioui reviewed

View reviewed changes

xla/service/gpu/tests/gpu_fused_mha_test.cc Outdated Show resolved Hide resolved


          Refactor test

d0ae3cf

wenscarl force-pushed the sdpa_fp8_bwd branch from 878c4da to d0ae3cf Compare

October 8, 2024 17:08

wenscarl requested a review from bchetioui

October 8, 2024 17:08

bchetioui approved these changes

View reviewed changes

Member

bchetioui left a comment

@wenscarl thank you for thoughtfully addressing the comments! Love the new tests, they look great!

I'd prefer if we ran without HLO passes here, but I suppose it doesn't hurt the test that much and improves readability to leave it as is---since the clipping logic can be called several times without duplication.

copybara-service bot pushed a commit that referenced this pull request


          PR #15331: Support cuDNN frontend scaled dot product attention for FP…

6e9780b

…8. Part- 2(backward)

Imported from GitHub PR #15331

As the 2nd part of #15092.
NOTE: this feature relies on cudnn-frontend v1.6.1 which is not in XLA yet.
Copybara import of the project:

--
06db3c8 by shuw <shuw@nvidia.com>:

Scaled dot product attention implementation by cudnn.

--
937b0e2 by shuw <shuw@nvidia.com>:

Improve after review 1

--
398b2ba by shuw <shuw@nvidia.com>:

clang-format

--
0825789 by Shu Wang <shuw@nvidia.com>:

fix typo.
--
d0ae3cf by shuw <shuw@nvidia.com>:

Refactor test

Merging this change closes #15331

FUTURE_COPYBARA_INTEGRATE_REVIEW=#15331 from wenscarl:sdpa_fp8_bwd d0ae3cf
PiperOrigin-RevId: 683501409

copybara-service bot mentioned this pull request

PR #15331: Support cuDNN frontend scaled dot product attention for FP8. Part- 2(backward) #18088

Open

copybara-service bot pushed a commit to tensorflow/tensorflow that referenced this pull request


          PR #15331: Support cuDNN frontend scaled dot product attention for FP…

d9c3dbc

…8. Part- 2(backward)

Imported from GitHub PR openxla/xla#15331

As the 2nd part of #15092.
NOTE: this feature relies on cudnn-frontend v1.6.1 which is not in XLA yet.
Copybara import of the project:

--
06db3c8349ca017440a2b9c4f4a7c41e557f03af by shuw <shuw@nvidia.com>:

Scaled dot product attention implementation by cudnn.

--
937b0e26ebcf5d48fce15fed8573d7c58b47e689 by shuw <shuw@nvidia.com>:

Improve after review 1

--
398b2ba2cef82f701a0ddecb7553423d92b1f902 by shuw <shuw@nvidia.com>:

clang-format

--
08257899ea899f66799bc701d81aad6ea94af6a0 by Shu Wang <shuw@nvidia.com>:

fix typo.
--
d0ae3cf52b7483c254137d8300f4c00aa963a7c6 by shuw <shuw@nvidia.com>:

Refactor test

Merging this change closes #15331

FUTURE_COPYBARA_INTEGRATE_REVIEW=openxla/xla#15331 from wenscarl:sdpa_fp8_bwd d0ae3cf52b7483c254137d8300f4c00aa963a7c6
PiperOrigin-RevId: 683501409

copybara-service bot mentioned this pull request

PR #15331: Support cuDNN frontend scaled dot product attention for FP8. Part- 2(backward) tensorflow/tensorflow#77364

Draft

copybara-service bot pushed a commit that referenced this pull request


          PR #15331: Support cuDNN frontend scaled dot product attention for FP…

07a04b5

…8. Part- 2(backward)

Imported from GitHub PR #15331

As the 2nd part of #15092.
NOTE: this feature relies on cudnn-frontend v1.6.1 which is not in XLA yet.
Copybara import of the project:

--
06db3c8 by shuw <shuw@nvidia.com>:

Scaled dot product attention implementation by cudnn.

--
937b0e2 by shuw <shuw@nvidia.com>:

Improve after review 1

--
398b2ba by shuw <shuw@nvidia.com>:

clang-format

--
0825789 by Shu Wang <shuw@nvidia.com>:

fix typo.
--
d0ae3cf by shuw <shuw@nvidia.com>:

Refactor test

Merging this change closes #15331

FUTURE_COPYBARA_INTEGRATE_REVIEW=#15331 from wenscarl:sdpa_fp8_bwd d0ae3cf
PiperOrigin-RevId: 683501409

copybara-service bot pushed a commit to tensorflow/tensorflow that referenced this pull request


          PR #15331: Support cuDNN frontend scaled dot product attention for FP…

52f024d

…8. Part- 2(backward)

Imported from GitHub PR openxla/xla#15331

As the 2nd part of #15092.
NOTE: this feature relies on cudnn-frontend v1.6.1 which is not in XLA yet.
Copybara import of the project:

--
06db3c8349ca017440a2b9c4f4a7c41e557f03af by shuw <shuw@nvidia.com>:

Scaled dot product attention implementation by cudnn.

--
937b0e26ebcf5d48fce15fed8573d7c58b47e689 by shuw <shuw@nvidia.com>:

Improve after review 1

--
398b2ba2cef82f701a0ddecb7553423d92b1f902 by shuw <shuw@nvidia.com>:

clang-format

--
08257899ea899f66799bc701d81aad6ea94af6a0 by Shu Wang <shuw@nvidia.com>:

fix typo.
--
d0ae3cf52b7483c254137d8300f4c00aa963a7c6 by shuw <shuw@nvidia.com>:

Refactor test

Merging this change closes #15331

FUTURE_COPYBARA_INTEGRATE_REVIEW=openxla/xla#15331 from wenscarl:sdpa_fp8_bwd d0ae3cf52b7483c254137d8300f4c00aa963a7c6
PiperOrigin-RevId: 683501409

copybara-service bot pushed a commit that referenced this pull request


          PR #15331: Support cuDNN frontend scaled dot product attention for FP…

a3522cf

…8. Part- 2(backward)

Imported from GitHub PR #15331

As the 2nd part of #15092.
NOTE: this feature relies on cudnn-frontend v1.6.1 which is not in XLA yet.
Copybara import of the project:

--
06db3c8 by shuw <shuw@nvidia.com>:

Scaled dot product attention implementation by cudnn.

--
937b0e2 by shuw <shuw@nvidia.com>:

Improve after review 1

--
398b2ba by shuw <shuw@nvidia.com>:

clang-format

--
0825789 by Shu Wang <shuw@nvidia.com>:

fix typo.
--
d0ae3cf by shuw <shuw@nvidia.com>:

Refactor test

Merging this change closes #15331

FUTURE_COPYBARA_INTEGRATE_REVIEW=#15331 from wenscarl:sdpa_fp8_bwd d0ae3cf
PiperOrigin-RevId: 683501409

copybara-service bot pushed a commit to tensorflow/tensorflow that referenced this pull request


          PR #15331: Support cuDNN frontend scaled dot product attention for FP…

090e061

…8. Part- 2(backward)

Imported from GitHub PR openxla/xla#15331

As the 2nd part of #15092.
NOTE: this feature relies on cudnn-frontend v1.6.1 which is not in XLA yet.
Copybara import of the project:

--
06db3c8349ca017440a2b9c4f4a7c41e557f03af by shuw <shuw@nvidia.com>:

Scaled dot product attention implementation by cudnn.

--
937b0e26ebcf5d48fce15fed8573d7c58b47e689 by shuw <shuw@nvidia.com>:

Improve after review 1

--
398b2ba2cef82f701a0ddecb7553423d92b1f902 by shuw <shuw@nvidia.com>:

clang-format

--
08257899ea899f66799bc701d81aad6ea94af6a0 by Shu Wang <shuw@nvidia.com>:

fix typo.
--
d0ae3cf52b7483c254137d8300f4c00aa963a7c6 by shuw <shuw@nvidia.com>:

Refactor test

Merging this change closes #15331

FUTURE_COPYBARA_INTEGRATE_REVIEW=openxla/xla#15331 from wenscarl:sdpa_fp8_bwd d0ae3cf52b7483c254137d8300f4c00aa963a7c6
PiperOrigin-RevId: 683501409

copybara-service bot pushed a commit that referenced this pull request


          PR #15331: Support cuDNN frontend scaled dot product attention for FP…

19c8ccc

…8. Part- 2(backward)

Imported from GitHub PR #15331

As the 2nd part of #15092.
NOTE: this feature relies on cudnn-frontend v1.6.1 which is not in XLA yet.
Copybara import of the project:

--
06db3c8 by shuw <shuw@nvidia.com>:

Scaled dot product attention implementation by cudnn.

--
937b0e2 by shuw <shuw@nvidia.com>:

Improve after review 1

--
398b2ba by shuw <shuw@nvidia.com>:

clang-format

--
0825789 by Shu Wang <shuw@nvidia.com>:

fix typo.
--
d0ae3cf by shuw <shuw@nvidia.com>:

Refactor test

Merging this change closes #15331

FUTURE_COPYBARA_INTEGRATE_REVIEW=#15331 from wenscarl:sdpa_fp8_bwd d0ae3cf
PiperOrigin-RevId: 684025541

copybara-service bot mentioned this pull request

PR #15331: Support cuDNN frontend scaled dot product attention for FP8. Part- 2(backward) #18100

Merged

copybara-service bot pushed a commit to tensorflow/tensorflow that referenced this pull request


          PR #15331: Support cuDNN frontend scaled dot product attention for FP…

d2f0f49

…8. Part- 2(backward)

Imported from GitHub PR openxla/xla#15331

As the 2nd part of #15092.
NOTE: this feature relies on cudnn-frontend v1.6.1 which is not in XLA yet.
Copybara import of the project:

--
06db3c8349ca017440a2b9c4f4a7c41e557f03af by shuw <shuw@nvidia.com>:

Scaled dot product attention implementation by cudnn.

--
937b0e26ebcf5d48fce15fed8573d7c58b47e689 by shuw <shuw@nvidia.com>:

Improve after review 1

--
398b2ba2cef82f701a0ddecb7553423d92b1f902 by shuw <shuw@nvidia.com>:

clang-format

--
08257899ea899f66799bc701d81aad6ea94af6a0 by Shu Wang <shuw@nvidia.com>:

fix typo.
--
d0ae3cf52b7483c254137d8300f4c00aa963a7c6 by shuw <shuw@nvidia.com>:

Refactor test

Merging this change closes #15331

FUTURE_COPYBARA_INTEGRATE_REVIEW=openxla/xla#15331 from wenscarl:sdpa_fp8_bwd d0ae3cf52b7483c254137d8300f4c00aa963a7c6
PiperOrigin-RevId: 684025541

copybara-service bot mentioned this pull request

PR #15331: Support cuDNN frontend scaled dot product attention for FP8. Part- 2(backward) tensorflow/tensorflow#77380

Closed

copybara-service bot pushed a commit that referenced this pull request


          PR #15331: Support cuDNN frontend scaled dot product attention for FP…

fab1b65

…8. Part- 2(backward)

Imported from GitHub PR #15331

As the 2nd part of #15092.
NOTE: this feature relies on cudnn-frontend v1.6.1 which is not in XLA yet.
Copybara import of the project:

--
06db3c8 by shuw <shuw@nvidia.com>:

Scaled dot product attention implementation by cudnn.

--
937b0e2 by shuw <shuw@nvidia.com>:

Improve after review 1

--
398b2ba by shuw <shuw@nvidia.com>:

clang-format

--
0825789 by Shu Wang <shuw@nvidia.com>:

fix typo.
--
d0ae3cf by shuw <shuw@nvidia.com>:

Refactor test

Merging this change closes #15331

FUTURE_COPYBARA_INTEGRATE_REVIEW=#15331 from wenscarl:sdpa_fp8_bwd d0ae3cf
PiperOrigin-RevId: 684025541

copybara-service bot pushed a commit that referenced this pull request


          PR #15331: Support cuDNN frontend scaled dot product attention for FP…

9952c2b

…8. Part- 2(backward)

Imported from GitHub PR #15331

As the 2nd part of #15092.
NOTE: this feature relies on cudnn-frontend v1.6.1 which is not in XLA yet.
Copybara import of the project:

--
06db3c8 by shuw <shuw@nvidia.com>:

Scaled dot product attention implementation by cudnn.

--
937b0e2 by shuw <shuw@nvidia.com>:

Improve after review 1

--
398b2ba by shuw <shuw@nvidia.com>:

clang-format

--
0825789 by Shu Wang <shuw@nvidia.com>:

fix typo.
--
d0ae3cf by shuw <shuw@nvidia.com>:

Refactor test

Merging this change closes #15331

FUTURE_COPYBARA_INTEGRATE_REVIEW=#15331 from wenscarl:sdpa_fp8_bwd d0ae3cf
PiperOrigin-RevId: 684025541

copybara-service bot pushed a commit that referenced this pull request


          PR #15331: Support cuDNN frontend scaled dot product attention for FP…

4542c8d

…8. Part- 2(backward)

Imported from GitHub PR #15331

As the 2nd part of #15092.
NOTE: this feature relies on cudnn-frontend v1.6.1 which is not in XLA yet.
Copybara import of the project:

--
06db3c8 by shuw <shuw@nvidia.com>:

Scaled dot product attention implementation by cudnn.

--
937b0e2 by shuw <shuw@nvidia.com>:

Improve after review 1

--
398b2ba by shuw <shuw@nvidia.com>:

clang-format

--
0825789 by Shu Wang <shuw@nvidia.com>:

fix typo.
--
d0ae3cf by shuw <shuw@nvidia.com>:

Refactor test

Merging this change closes #15331

FUTURE_COPYBARA_INTEGRATE_REVIEW=#15331 from wenscarl:sdpa_fp8_bwd d0ae3cf
PiperOrigin-RevId: 683501409

copybara-service bot pushed a commit to tensorflow/tensorflow that referenced this pull request


          PR #15331: Support cuDNN frontend scaled dot product attention for FP…

a6e5f25

…8. Part- 2(backward)

Imported from GitHub PR openxla/xla#15331

As the 2nd part of #15092.
NOTE: this feature relies on cudnn-frontend v1.6.1 which is not in XLA yet.
Copybara import of the project:

--
06db3c8349ca017440a2b9c4f4a7c41e557f03af by shuw <shuw@nvidia.com>:

Scaled dot product attention implementation by cudnn.

--
937b0e26ebcf5d48fce15fed8573d7c58b47e689 by shuw <shuw@nvidia.com>:

Improve after review 1

--
398b2ba2cef82f701a0ddecb7553423d92b1f902 by shuw <shuw@nvidia.com>:

clang-format

--
08257899ea899f66799bc701d81aad6ea94af6a0 by Shu Wang <shuw@nvidia.com>:

fix typo.
--
d0ae3cf52b7483c254137d8300f4c00aa963a7c6 by shuw <shuw@nvidia.com>:

Refactor test

Merging this change closes #15331

FUTURE_COPYBARA_INTEGRATE_REVIEW=openxla/xla#15331 from wenscarl:sdpa_fp8_bwd d0ae3cf52b7483c254137d8300f4c00aa963a7c6
PiperOrigin-RevId: 684025541

copybara-service bot pushed a commit to tensorflow/tensorflow that referenced this pull request


          PR #15331: Support cuDNN frontend scaled dot product attention for FP…

70403d2

…8. Part- 2(backward)

Imported from GitHub PR openxla/xla#15331

As the 2nd part of #15092.
NOTE: this feature relies on cudnn-frontend v1.6.1 which is not in XLA yet.
Copybara import of the project:

--
06db3c8349ca017440a2b9c4f4a7c41e557f03af by shuw <shuw@nvidia.com>:

Scaled dot product attention implementation by cudnn.

--
937b0e26ebcf5d48fce15fed8573d7c58b47e689 by shuw <shuw@nvidia.com>:

Improve after review 1

--
398b2ba2cef82f701a0ddecb7553423d92b1f902 by shuw <shuw@nvidia.com>:

clang-format

--
08257899ea899f66799bc701d81aad6ea94af6a0 by Shu Wang <shuw@nvidia.com>:

fix typo.
--
d0ae3cf52b7483c254137d8300f4c00aa963a7c6 by shuw <shuw@nvidia.com>:

Refactor test

Merging this change closes #15331

FUTURE_COPYBARA_INTEGRATE_REVIEW=openxla/xla#15331 from wenscarl:sdpa_fp8_bwd d0ae3cf52b7483c254137d8300f4c00aa963a7c6
PiperOrigin-RevId: 683501409

copybara-service bot closed this in

467563e

copybara-service bot pushed a commit to tensorflow/tensorflow that referenced this pull request


          PR #15331: Support cuDNN frontend scaled dot product attention for FP…

3a8e08c

…8. Part- 2(backward)

Imported from GitHub PR openxla/xla#15331

As the 2nd part of #15092.
NOTE: this feature relies on cudnn-frontend v1.6.1 which is not in XLA yet.
Copybara import of the project:

--
06db3c8349ca017440a2b9c4f4a7c41e557f03af by shuw <shuw@nvidia.com>:

Scaled dot product attention implementation by cudnn.

--
937b0e26ebcf5d48fce15fed8573d7c58b47e689 by shuw <shuw@nvidia.com>:

Improve after review 1

--
398b2ba2cef82f701a0ddecb7553423d92b1f902 by shuw <shuw@nvidia.com>:

clang-format

--
08257899ea899f66799bc701d81aad6ea94af6a0 by Shu Wang <shuw@nvidia.com>:

fix typo.
--
d0ae3cf52b7483c254137d8300f4c00aa963a7c6 by shuw <shuw@nvidia.com>:

Refactor test

Merging this change closes #15331

PiperOrigin-RevId: 684062495

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet