-
Notifications
You must be signed in to change notification settings - Fork 751
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SYCL][CUDA] Improve kernel launch error handling for out-of-registers #12604
[SYCL][CUDA] Improve kernel launch error handling for out-of-registers #12604
Conversation
0bd1d10
to
11f09c1
Compare
11f09c1
to
7606868
Compare
…t-of-resources-registers-error
…t-of-resources-registers-error
4b28554
to
7e258e9
Compare
sycl/test-e2e/OptionalKernelFeatures/throw-exception-for-out-of-registers-on-kernel-launch.cpp
Outdated
Show resolved
Hide resolved
Heyo @intel/llvm-reviewers-runtime. May I get a look at this one? Thank you! |
…t-of-resources-registers-error
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
…t-of-resources-registers-error
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please pull in the latest sycl branch changes then update the UNIFIED_RUNTIME_REPO
and UNIFIED_RUNTIME_TAG
variables as shown in this diff:
diff --git a/sycl/plugins/unified_runtime/CMakeLists.txt b/sycl/plugins/unified_runtime/CMakeLists.txt
index a081bdb010d8..536358cb36af 100644
--- a/sycl/plugins/unified_runtime/CMakeLists.txt
+++ b/sycl/plugins/unified_runtime/CMakeLists.txt
@@ -100,13 +100,13 @@ if(SYCL_PI_UR_USE_FETCH_CONTENT)
endfunction()
set(UNIFIED_RUNTIME_REPO "https://github.com/oneapi-src/unified-runtime.git")
- # commit 9f783837089c970a22cda08f768aa3dbed38f0d3
- # Merge: c015f892 b9442104
+ # commit 5083f4f96557672b7b6a55ea53347896d40549d7
+ # Merge: a97eed15 4c3f9abe
# Author: Kenneth Benzie (Benie) <k.benzie@codeplay.com>
- # Date: Fri May 31 10:20:23 2024 +0100
- # Merge pull request #1533 from AllanZyne/sanitizer-buffer
- # [DeviceSanitizer] Support detecting out-of-bounds errors on sycl::buffer
- set(UNIFIED_RUNTIME_TAG 9f783837089c970a22cda08f768aa3dbed38f0d3)
+ # Date: Fri May 31 17:20:01 2024 +0100
+ # Merge pull request #1397 from GeorgeWeb/georgi/check-allocation-error-on-event-from-native-handle
+ # [CUDA][HIP] Catch and report bad_alloc errors for event object creation
+ set(UNIFIED_RUNTIME_TAG 5083f4f96557672b7b6a55ea53347896d40549d7)
fetch_adapter_source(level_zero
${UNIFIED_RUNTIME_REPO}
…t-of-resources-registers-error
deb7c3e
to
a868e0f
Compare
@intel/llvm-gatekeepers please merge |
using std::string_view_literals::operator""sv; | ||
auto Msg = "Exceeded the number of registers available on the hardware."sv; | ||
if (std::string(e.what()).find(Msg) != std::string::npos) { | ||
auto Errc = sycl::make_error_code(sycl::errc::nd_range); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@GeorgeWeb, I was looking at the PR and got confused by it. Could you please clarify what is the key change here? PR description says that we used to display confusing error to users, but check for the exception message wasn't changed by the PR.
What was the confusing part then? I see that in #12363 we had a bug report which contains invalid message, but this test had been introduced a year ago before that issue was submitted in #9106. What am I missing here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In summary the wrong part of the message as per the PRs description was just the plugin error code description - PI_ERROR_INVALID_WORK_GROUP_SIZE
should have been PI_ERROR_OUT_OF_LAUNCH_RESOURCES
. The reason I've added the errc::nd_range
error code check was just for more verbosity but it wasn't of importance here.
The real issue about "reporting a completely wrong message" in the OPs report (#12363) was due to a mistake on this line https://github.com/intel/llvm/pull/9106/files#diff-7525901710934f7bdb2ad36238c4b67163f112d3bd233db7af0b0078b5b01e80R3263 which was fixed by this UR cuda change oneapi-src/unified-runtime#1299
This PR improves the handling of errors by specializing
PI_ERROR_OUT_OF_RESOURCES
.Previously, in the CUDA backend we handled the out of resources launch error (for exceeded registers) as invalid work group size error. Now pairing the new specialized handling with the UR adapter change oneapi-src/unified-runtime#1318 to return the correct error code, we no longer output a misleading error message to users.
Also, added a fallback message for the generic out of resources error codes returned from APIs (e.g. for kernel launch).
Fixes issue: oneapi-src/unified-runtime#1308