New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

[SYCL][CUDA] Improve kernel launch error handling for out-of-registers #12604

Merged

sommerlukas merged 16 commits into intel:sycl from GeorgeWeb:georgi/sycl-cuda-out-of-resources-registers-error

Jun 3, 2024

Contributor

GeorgeWeb commented Feb 5, 2024 •

edited

Loading

This PR improves the handling of errors by specializing PI_ERROR_OUT_OF_RESOURCES.

Previously, in the CUDA backend we handled the out of resources launch error (for exceeded registers) as invalid work group size error. Now pairing the new specialized handling with the UR adapter change oneapi-src/unified-runtime#1318 to return the correct error code, we no longer output a misleading error message to users.
Also, added a fallback message for the generic out of resources error codes returned from APIs (e.g. for kernel launch).

Fixes issue: oneapi-src/unified-runtime#1308

GeorgeWeb added 2 commits

February 2, 2024 16:26


          [SYCL][CUDA] Improve kernel launch error handling for out-of-registers

c9d329b


          Add default fallback error msg for out-of-resources

5b7cbec

GeorgeWeb mentioned this pull request

[CUDA] Use appropriate return code for out of registers kernel launch oneapi-src/unified-runtime#1318

Merged

GeorgeWeb temporarily deployed to WindowsCILock

February 5, 2024 11:26

— with

GitHub Actions Inactive

GeorgeWeb temporarily deployed to WindowsCILock

February 5, 2024 11:47

— with

GitHub Actions Inactive


          Cleanup test

cd4a8f8

GeorgeWeb had a problem deploying to WindowsCILock

February 5, 2024 14:37

— with

GitHub Actions Failure

GeorgeWeb force-pushed the georgi/sycl-cuda-out-of-resources-registers-error branch from 0bd1d10 to 11f09c1 Compare

February 5, 2024 15:32

GeorgeWeb had a problem deploying to WindowsCILock

February 5, 2024 15:33

— with

GitHub Actions Error


          Override UR tag to fetch for testing

GeorgeWeb force-pushed the georgi/sycl-cuda-out-of-resources-registers-error branch from 11f09c1 to 7606868 Compare

February 5, 2024 15:33

GeorgeWeb had a problem deploying to WindowsCILock

February 5, 2024 15:34

— with

GitHub Actions Failure


          Merge remote-tracking branch 'upstream/sycl' into georgi/sycl-cuda-ou…

95d8ced

…t-of-resources-registers-error

GeorgeWeb had a problem deploying to WindowsCILock

February 5, 2024 15:44

— with

GitHub Actions Failure

GeorgeWeb temporarily deployed to WindowsCILock

February 5, 2024 15:47

— with

GitHub Actions Inactive

GeorgeWeb temporarily deployed to WindowsCILock

February 5, 2024 16:08

— with

GitHub Actions Inactive

GeorgeWeb marked this pull request as ready for review

February 6, 2024 12:42

GeorgeWeb requested review from a team as code owners

February 6, 2024 12:42

GeorgeWeb requested a review from sergey-semenov

February 6, 2024 12:42

GeorgeWeb added 2 commits

February 6, 2024 13:26


          Update UR tag

4d6eb99


          Merge remote-tracking branch 'upstream/sycl' into georgi/sycl-cuda-ou…

7e258e9

…t-of-resources-registers-error

GeorgeWeb force-pushed the georgi/sycl-cuda-out-of-resources-registers-error branch from 4b28554 to 7e258e9 Compare

February 6, 2024 13:27

GeorgeWeb temporarily deployed to WindowsCILock

February 6, 2024 13:30

— with

GitHub Actions Inactive

GeorgeWeb temporarily deployed to WindowsCILock

February 6, 2024 13:49

— with

GitHub Actions Inactive

GeorgeWeb commented

View reviewed changes

sycl/test-e2e/OptionalKernelFeatures/throw-exception-for-out-of-registers-on-kernel-launch.cpp Outdated Show resolved Hide resolved

rafbiels mentioned this pull request

[CUDA] Max local mem size check should return OUT_OF_RESOURCES oneapi-src/unified-runtime#1322

Open

GeorgeWeb marked this pull request as draft

February 29, 2024 15:09

Contributor Author

GeorgeWeb commented Feb 29, 2024

Heyo @intel/llvm-reviewers-runtime. May I get a look at this one? Thank you!

steffenlarsen reviewed

View reviewed changes

sycl/source/detail/error_handling/error_handling.cpp Outdated Show resolved Hide resolved

GeorgeWeb added 2 commits

March 4, 2024 15:37


          Address review comments

90b4ab3


          Merge remote-tracking branch 'upstream/sycl' into georgi/sycl-cuda-ou…

6fcadf4

…t-of-resources-registers-error

GeorgeWeb had a problem deploying to WindowsCILock

March 4, 2024 15:38

— with

GitHub Actions Failure

steffenlarsen reviewed

View reviewed changes

sycl/source/detail/error_handling/error_handling.cpp Outdated Show resolved Hide resolved

GeorgeWeb added 2 commits

March 7, 2024 12:55


          Update use of sycl1.2.1 runtime_error to sycl2020 exception

1837e8f


          Merge remote-tracking branch 'upstream/sycl' into georgi/sycl-cuda-ou…

6f63a73

…t-of-resources-registers-error

GeorgeWeb had a problem deploying to WindowsCILock

March 7, 2024 12:56

— with

GitHub Actions Error

GeorgeWeb had a problem deploying to WindowsCILock

March 7, 2024 12:57

— with

GitHub Actions Failure

GeorgeWeb had a problem deploying to WindowsCILock

March 8, 2024 14:10

— with

GitHub Actions Failure

steffenlarsen approved these changes

View reviewed changes

Contributor

steffenlarsen left a comment

LGTM!

GeorgeWeb added 2 commits

March 28, 2024 10:28


          Merge remote-tracking branch 'upstream/sycl' into georgi/sycl-cuda-ou…

179f750

…t-of-resources-registers-error


          Update UR commit tag

d7453dd

kbenzie reviewed

View reviewed changes

Contributor

kbenzie left a comment

Please pull in the latest sycl branch changes then update the UNIFIED_RUNTIME_REPO and UNIFIED_RUNTIME_TAG variables as shown in this diff:

diff --git a/sycl/plugins/unified_runtime/CMakeLists.txt b/sycl/plugins/unified_runtime/CMakeLists.txt
index a081bdb010d8..536358cb36af 100644
--- a/sycl/plugins/unified_runtime/CMakeLists.txt
+++ b/sycl/plugins/unified_runtime/CMakeLists.txt
@@ -100,13 +100,13 @@ if(SYCL_PI_UR_USE_FETCH_CONTENT)
   endfunction()

   set(UNIFIED_RUNTIME_REPO "https://github.com/oneapi-src/unified-runtime.git")
-  # commit 9f783837089c970a22cda08f768aa3dbed38f0d3
-  # Merge: c015f892 b9442104
+  # commit 5083f4f96557672b7b6a55ea53347896d40549d7
+  # Merge: a97eed15 4c3f9abe
   # Author: Kenneth Benzie (Benie) <k.benzie@codeplay.com>
-  # Date:   Fri May 31 10:20:23 2024 +0100
-  #     Merge pull request #1533 from AllanZyne/sanitizer-buffer
-  #     [DeviceSanitizer] Support detecting out-of-bounds errors on sycl::buffer
-  set(UNIFIED_RUNTIME_TAG 9f783837089c970a22cda08f768aa3dbed38f0d3)
+  # Date:   Fri May 31 17:20:01 2024 +0100
+  #     Merge pull request #1397 from GeorgeWeb/georgi/check-allocation-error-on-event-from-native-handle
+  #     [CUDA][HIP] Catch and report bad_alloc errors for event object creation
+  set(UNIFIED_RUNTIME_TAG 5083f4f96557672b7b6a55ea53347896d40549d7)

   fetch_adapter_source(level_zero
     ${UNIFIED_RUNTIME_REPO}

sycl/plugins/unified_runtime/CMakeLists.txt Outdated Show resolved Hide resolved

GeorgeWeb added 2 commits

May 31, 2024 17:53


          Merge remote-tracking branch 'upstream/sycl' into georgi/sycl-cuda-ou…

9bcbe59

…t-of-resources-registers-error


          Update UR tag

fbae8ea

kbenzie reviewed

View reviewed changes

sycl/plugins/unified_runtime/CMakeLists.txt Outdated Show resolved Hide resolved


          Revert overriding of FetchContent for UR

a868e0f

GeorgeWeb force-pushed the georgi/sycl-cuda-out-of-resources-registers-error branch from deb7c3e to a868e0f Compare

May 31, 2024 17:02

GeorgeWeb marked this pull request as ready for review

May 31, 2024 17:03

kbenzie approved these changes

View reviewed changes

GeorgeWeb temporarily deployed to WindowsCILock

May 31, 2024 17:21

— with

GitHub Actions Inactive

GeorgeWeb temporarily deployed to WindowsCILock

May 31, 2024 17:51

— with

GitHub Actions Inactive

Contributor

kbenzie commented Jun 3, 2024

@intel/llvm-gatekeepers please merge

sommerlukas merged commit 9f1cee5 into intel:sycl

14 checks passed

AlexeySachkov reviewed

View reviewed changes

sycl/test-e2e/OptionalKernelFeatures/throw-exception-for-out-of-registers-on-kernel-launch.cpp

                   using std::string_view_literals::operator""sv;
                   auto Msg = "Exceeded the number of registers available on the hardware."sv;
-                  if (std::string(e.what()).find(Msg) != std::string::npos) {
+                  auto Errc = sycl::make_error_code(sycl::errc::nd_range);

Contributor

AlexeySachkov Sep 10, 2024

@GeorgeWeb, I was looking at the PR and got confused by it. Could you please clarify what is the key change here? PR description says that we used to display confusing error to users, but check for the exception message wasn't changed by the PR.

What was the confusing part then? I see that in #12363 we had a bug report which contains invalid message, but this test had been introduced a year ago before that issue was submitted in #9106. What am I missing here?

Contributor Author

GeorgeWeb Sep 11, 2024

In summary the wrong part of the message as per the PRs description was just the plugin error code description - PI_ERROR_INVALID_WORK_GROUP_SIZE should have been PI_ERROR_OUT_OF_LAUNCH_RESOURCES. The reason I've added the errc::nd_range error code check was just for more verbosity but it wasn't of importance here.

The real issue about "reporting a completely wrong message" in the OPs report (#12363) was due to a mistake on this line https://github.com/intel/llvm/pull/9106/files#diff-7525901710934f7bdb2ad36238c4b67163f112d3bd233db7af0b0078b5b01e80R3263 which was fixed by this UR cuda change oneapi-src/unified-runtime#1299

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Reviewers

AlexeySachkov AlexeySachkov left review comments

steffenlarsen steffenlarsen approved these changes

kbenzie kbenzie approved these changes

sergey-semenov Awaiting requested review from sergey-semenov sergey-semenov is a code owner automatically assigned from intel/llvm-reviewers-runtime

Labels

None yet