-
Notifications
You must be signed in to change notification settings - Fork 244
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] HostAllocSuite failed initializing pinned memory pool: Maximum pool size exceeded intermittently #12194
Comments
previous this was reported in #12039 |
cc @abellina can you help take a look at this one? many thanks~ |
I believe we should look at the timing of initialization here and see if we can repro it. The @pxLi I right now down have time to look at this, but we can figure out if someone does. |
@firestarman reposted in the issue ticket |
Temporarily ignored the case in #12205 to unblock premerge and nightly CI for now |
…12205) this is a workaound for #12194 to unblock others before we figure out the root cause of the OOM issue ``` [2025-02-24T02:56:08.571Z] �[32mHostAllocSuite:�[0m [2025-02-24T02:56:08.571Z] �[33m- simple pinned tryAlloc !!! IGNORED !!! [2025-02-24T02:56:08.571Z] �[33m- simple non-pinned tryAlloc !!! IGNORED !!! [2025-02-24T02:56:08.571Z] �[33m- simple mixed tryAlloc !!! IGNORED !!! [2025-02-24T02:56:08.571Z] �[33m- simple pinned blocking alloc !!! IGNORED !!! [2025-02-24T02:56:08.571Z] �[33m- simple non-pinned blocking alloc !!! IGNORED !!! [2025-02-24T02:56:08.571Z] �[33m- simple mixed blocking alloc !!! IGNORED !!! [2025-02-24T02:56:08.571Z] �[33m- pinned blocking alloc with spill !!! IGNORED !!! [2025-02-24T02:56:08.571Z] �[33m- non-pinned blocking alloc with spill !!! IGNORED !!! [2025-02-24T02:56:08.571Z] �[33m- mixed blocking alloc with spill !!! IGNORED !!! ``` --------- Signed-off-by: Peixin Li <pxLi@nyu.edu>
For a succeeded log,
so the diff here is in the failed scenarios, it will throw error rmm log with
|
Need to revert #12205 once this error is fixed. |
Yep, we will keep the issue open until the real fix is in |
I ran on different gpu instances > 100 times with revert WAR change #12205 the only similar thing in my rare repo is before the suite aborts, it would throw the extra info of [Upstream 1073741824B] message compared to the success run.
which seems to indicate that the issue fails during afterAll() phase of this suite while try to reinit pinned memory for other cases https://github.com/NVIDIA/spark-rapids/blob/branch-25.04/sql-plugin/src/test/scala/com/nvidia/spark/rapids/HostAllocSuite.scala#L355-L356 from the stack trace
|
I created internal Thanks! |
Describe the bug
this occurs in both 25.02 and 25.04 (no specific commit), and could happen with all spark shims randomly.
The gpu and host memory usage looks OK before the actual case from the metrics, so this may not be caused by OS level.
We may need to increase the mvn heap size (https://github.com/NVIDIA/spark-rapids/blob/branch-25.02/pom.xml#L1453) to avoid the issue or limit memory usage in this case, thanksthis is offheap memory usagee.g. rapids_nightly-pre_release-github run: 777 (worker 2 stage),
Steps/Code to reproduce bug
Please provide a list of steps or a code sample to reproduce the issue.
Avoid posting private or sensitive data.
Expected behavior
A clear and concise description of what you expected to happen.
Environment details (please complete the following information)
Additional context
Add any other context about the problem here.
The text was updated successfully, but these errors were encountered: