[BUG] HostAllocSuite failed initializing pinned memory pool: Maximum pool size exceeded intermittently #12194

pxLi · 2025-02-21T06:25:58Z

Describe the bug
this occurs in both 25.02 and 25.04 (no specific commit), and could happen with all spark shims randomly.

The gpu and host memory usage looks OK before the actual case from the metrics, so this may not be caused by OS level. ~~We may need to increase the mvn heap size (https://github.com/NVIDIA/spark-rapids/blob/branch-25.02/pom.xml#L1453) to avoid the issue or limit memory usage in this case, thanks~~ this is offheap memory usage

e.g. rapids_nightly-pre_release-github run: 777 (worker 2 stage),


[2025-02-21T06:23:53.796Z] Run starting. Expected test count is: 46
[2025-02-21T06:23:53.796Z] RangeConfMatcherSuite:
[2025-02-21T06:23:53.796Z] - empty
[2025-02-21T06:23:53.796Z] - bad ranges
[2025-02-21T06:23:53.796Z] - singles
[2025-02-21T06:23:53.796Z] - range only
[2025-02-21T06:23:53.796Z] - singles range mix
[2025-02-21T06:23:53.797Z] ArmSuite:
[2025-02-21T06:23:53.797Z] - withResource for Option[AutoCloseable]
[2025-02-21T06:23:53.797Z] - closeOnExcept single instance
[2025-02-21T06:23:53.797Z] - closeOnExcept sequence
[2025-02-21T06:23:53.797Z] - closeOnExcept arraybuffer
[2025-02-21T06:23:53.797Z] - closeOnExcept suppression single instance
[2025-02-21T06:23:53.797Z] - closeOnExcept suppression sequence
[2025-02-21T06:23:53.797Z] - closeOnExcept suppression arraybuffer
[2025-02-21T06:23:53.797Z] HostAllocSuite:
[2025-02-21T06:23:53.797Z] [2025-02-21 06:23:51.966] [RMM] [error] [A][Stream 0][Upstream 256B][FAILURE maximum pool size exceeded]
[2025-02-21T06:23:53.797Z] - simple pinned tryAlloc
[2025-02-21T06:23:53.797Z] [2025-02-21 06:23:52.988] [RMM] [error] [A][Stream 0][Upstream 4096B][FAILURE maximum pool size exceeded]
[2025-02-21T06:23:53.797Z] [2025-02-21 06:23:52.989] [RMM] [error] [A][Stream 0][Upstream 256B][FAILURE maximum pool size exceeded]
[2025-02-21T06:23:53.797Z] - simple non-pinned tryAlloc
[2025-02-21T06:23:54.058Z] [2025-02-21 06:23:53.999] [RMM] [error] [A][Stream 0][Upstream 4096B][FAILURE maximum pool size exceeded]
[2025-02-21T06:23:54.058Z] [2025-02-21 06:23:54.000] [RMM] [error] [A][Stream 0][Upstream 256B][FAILURE maximum pool size exceeded]
[2025-02-21T06:23:54.058Z] - simple mixed tryAlloc
[2025-02-21T06:23:55.432Z] [2025-02-21 06:23:55.421] [RMM] [error] [A][Stream 0][Upstream 256B][FAILURE maximum pool size exceeded]
[2025-02-21T06:23:55.692Z] - simple pinned blocking alloc
[2025-02-21T06:23:56.623Z] [2025-02-21 06:23:56.444] [RMM] [error] [A][Stream 0][Upstream 4096B][FAILURE maximum pool size exceeded]
[2025-02-21T06:23:56.624Z] [2025-02-21 06:23:56.446] [RMM] [error] [A][Stream 0][Upstream 256B][FAILURE maximum pool size exceeded]
[2025-02-21T06:23:56.624Z] [2025-02-21 06:23:56.456] [RMM] [error] [A][Stream 0][Upstream 256B][FAILURE maximum pool size exceeded]
[2025-02-21T06:23:56.624Z] - simple non-pinned blocking alloc
[2025-02-21T06:23:57.555Z] [2025-02-21 06:23:57.475] [RMM] [error] [A][Stream 0][Upstream 256B][FAILURE maximum pool size exceeded]
[2025-02-21T06:23:57.555Z] - simple mixed blocking alloc
[2025-02-21T06:23:58.930Z] [2025-02-21 06:23:58.509] [RMM] [error] [A][Stream 0][Upstream 256B][FAILURE maximum pool size exceeded]
[2025-02-21T06:23:58.931Z] - pinned blocking alloc with spill
[2025-02-21T06:23:59.866Z] [2025-02-21 06:23:59.574] [RMM] [error] [A][Stream 0][Upstream 4096B][FAILURE maximum pool size exceeded]
[2025-02-21T06:23:59.869Z] [2025-02-21 06:23:59.576] [RMM] [error] [A][Stream 0][Upstream 256B][FAILURE maximum pool size exceeded]
[2025-02-21T06:23:59.869Z] [2025-02-21 06:23:59.582] [RMM] [error] [A][Stream 0][Upstream 256B][FAILURE maximum pool size exceeded]
[2025-02-21T06:23:59.869Z] - non-pinned blocking alloc with spill
[2025-02-21T06:24:00.970Z] [2025-02-21 06:24:00.603] [RMM] [error] [A][Stream 0][Upstream 256B][FAILURE maximum pool size exceeded]
[2025-02-21T06:24:01.228Z] - mixed blocking alloc with spill
[2025-02-21T06:24:02.162Z] [2025-02-21 06:24:02.016] [RMM] [error] [A][Stream 0x1][Upstream 1073741824B][FAILURE maximum pool size exceeded]
[2025-02-21T06:24:02.162Z] SUITE ABORTED - HostAllocSuite: Error initializing pinned memory pool
[2025-02-21T06:24:02.162Z]   java.lang.RuntimeException: Error initializing pinned memory pool
[2025-02-21T06:24:02.162Z]   at ai.rapids.cudf.PinnedMemoryPool.getSingleton(PinnedMemoryPool.java:92)
[2025-02-21T06:24:02.162Z]   at ai.rapids.cudf.PinnedMemoryPool.getTotalPoolSizeBytes(PinnedMemoryPool.java:212)
[2025-02-21T06:24:02.162Z]   at com.nvidia.spark.rapids.HostAlloc.<init>(HostAlloc.scala:29)
[2025-02-21T06:24:02.162Z]   at com.nvidia.spark.rapids.HostAlloc$.initialize(HostAlloc.scala:267)
[2025-02-21T06:24:02.162Z]   at com.nvidia.spark.rapids.HostAllocSuite.afterAll(HostAllocSuite.scala:354)
[2025-02-21T06:24:02.162Z]   at org.scalatest.BeforeAndAfterAll.$anonfun$run$1(BeforeAndAfterAll.scala:225)
[2025-02-21T06:24:02.162Z]   at org.scalatest.Status.$anonfun$withAfterEffect$1(Status.scala:377)
[2025-02-21T06:24:02.162Z]   at org.scalatest.Status.$anonfun$withAfterEffect$1$adapted(Status.scala:373)
[2025-02-21T06:24:02.162Z]   at org.scalatest.CompositeStatus.whenCompleted(Status.scala:962)
[2025-02-21T06:24:02.162Z]   at org.scalatest.Status.withAfterEffect(Status.scala:373)
[2025-02-21T06:24:02.162Z]   ...
[2025-02-21T06:24:02.162Z]   Cause: java.util.concurrent.ExecutionException: java.lang.OutOfMemoryError: Could not allocate native memory: std::bad_alloc: out_of_memory: RMM failure at:/home/jenkins/agent/workspace/jenkins-spark-rapids-jni_nightly-pre_release-415-cuda11/target/libcudf/cmake-build/_deps/rmm-src/include/rmm/mr/device/pool_memory_resource.hpp:276: Maximum pool size exceeded
[2025-02-21T06:24:02.162Z]   at java.util.concurrent.FutureTask.report(FutureTask.java:122)
[2025-02-21T06:24:02.162Z]   at java.util.concurrent.FutureTask.get(FutureTask.java:192)
[2025-02-21T06:24:02.162Z]   at ai.rapids.cudf.PinnedMemoryPool.getSingleton(PinnedMemoryPool.java:90)
[2025-02-21T06:24:02.162Z]   at ai.rapids.cudf.PinnedMemoryPool.getTotalPoolSizeBytes(PinnedMemoryPool.java:212)
[2025-02-21T06:24:02.162Z]   at com.nvidia.spark.rapids.HostAlloc.<init>(HostAlloc.scala:29)
[2025-02-21T06:24:02.162Z]   at com.nvidia.spark.rapids.HostAlloc$.initialize(HostAlloc.scala:267)
[2025-02-21T06:24:02.162Z]   at com.nvidia.spark.rapids.HostAllocSuite.afterAll(HostAllocSuite.scala:354)
[2025-02-21T06:24:02.162Z]   at org.scalatest.BeforeAndAfterAll.$anonfun$run$1(BeforeAndAfterAll.scala:225)
[2025-02-21T06:24:02.162Z]   at org.scalatest.Status.$anonfun$withAfterEffect$1(Status.scala:377)
[2025-02-21T06:24:02.162Z]   at org.scalatest.Status.$anonfun$withAfterEffect$1$adapted(Status.scala:373)
[2025-02-21T06:24:02.162Z]   ...
[2025-02-21T06:24:02.162Z]   Cause: java.lang.OutOfMemoryError: Could not allocate native memory: std::bad_alloc: out_of_memory: RMM failure at:/home/jenkins/agent/workspace/jenkins-spark-rapids-jni_nightly-pre_release-415-cuda11/target/libcudf/cmake-build/_deps/rmm-src/include/rmm/mr/device/pool_memory_resource.hpp:276: Maximum pool size exceeded
[2025-02-21T06:24:02.162Z]   at ai.rapids.cudf.Rmm.newPinnedPoolMemoryResource(Native Method)
[2025-02-21T06:24:02.162Z]   at ai.rapids.cudf.PinnedMemoryPool.<init>(PinnedMemoryPool.java:225)
[2025-02-21T06:24:02.162Z]   at ai.rapids.cudf.PinnedMemoryPool.lambda$initialize$1(PinnedMemoryPool.java:142)
[2025-02-21T06:24:02.162Z]   at java.util.concurrent.FutureTask.run(FutureTask.java:266)
[2025-02-21T06:24:02.162Z]   at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
[2025-02-21T06:24:02.162Z]   at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
[2025-02-21T06:24:02.162Z]   at java.lang.Thread.run(Thread.java:750)
[2025-02-21T06:24:02.162Z]   ...
[2025-02-21T06:24:02.162Z] TrafficControllerSuite:
[2025-02-21T06:24:02.162Z] - schedule tasks without blocking
[2025-02-21T06:24:02.162Z] - schedule task with blocking
[2025-02-21T06:24:02.420Z] - big task should be scheduled after all running tasks are completed
[2025-02-21T06:24:03.798Z] - all tasks are bigger than the total memory limit
[2025-02-21T06:24:03.799Z] - shutdown while blocking
[2025-02-21T06:24:03.799Z] GpuDeviceManagerSuite:
[2025-02-21T06:24:03.799Z] - Test Spark gpu resource
[2025-02-21T06:24:03.799Z] - Test Spark custom resource missed
[2025-02-21T06:24:03.799Z] - Test Spark multiple GPUs throws
[2025-02-21T06:24:03.799Z] - Test Spark custom resource
[2025-02-21T06:24:03.799Z] com.nvidia.spark.rapids.shims.spark332.SparkShimsSuite:
[2025-02-21T06:24:03.799Z] - spark shims version
[2025-02-21T06:24:03.799Z] - shuffle manager class
[2025-02-21T06:24:03.799Z] - TypeSig
[2025-02-21T06:24:03.799Z] ThrottlingExecutorSuite:
[2025-02-21T06:24:03.799Z] - tasks submitted should update the state
[2025-02-21T06:24:03.799Z] - tasks submission fails if totalHostMemoryBytes exceeds maxHostMemoryBytes
[2025-02-21T06:24:03.799Z] - submit one task heavier than maxHostMemoryBytes
[2025-02-21T06:24:04.059Z] - submit multiple tasks such that totalHostMemoryBytes does not exceed maxHostMemoryBytes
[2025-02-21T06:24:04.060Z] - shutdown while a task is blocked
[2025-02-21T06:24:04.060Z] - test task metrics
[2025-02-21T06:24:04.060Z] AsyncOutputStreamSuite:
[2025-02-21T06:24:04.318Z] - open, write, and close
[2025-02-21T06:24:04.318Z] - write after closed
[2025-02-21T06:24:04.318Z] - flush after closed
[2025-02-21T06:24:04.318Z] - write after error
[2025-02-21T06:24:04.318Z] - flush after error
[2025-02-21T06:24:04.318Z] - close after error
[2025-02-21T06:24:04.318Z] ThreadFactoryBuilderTest:
[2025-02-21T06:24:04.318Z] - test thread factory builder
[2025-02-21T06:24:04.318Z] Run completed in 25 seconds, 450 milliseconds.
[2025-02-21T06:24:04.318Z] Total number of tests run: 46
[2025-02-21T06:24:04.318Z] Suites: completed 9, aborted 1
[2025-02-21T06:24:04.318Z] Tests: succeeded 46, failed 0, canceled 0, ignored 0, pending 0
[2025-02-21T06:24:04.318Z] *** 1 SUITE ABORTED ***

Steps/Code to reproduce bug
Please provide a list of steps or a code sample to reproduce the issue.
Avoid posting private or sensitive data.

Expected behavior
A clear and concise description of what you expected to happen.

Environment details (please complete the following information)

Environment location: [Standalone, YARN, Kubernetes, Cloud(specify cloud provider)]
Spark configuration settings related to the issue

Additional context
Add any other context about the problem here.

The text was updated successfully, but these errors were encountered:

pxLi · 2025-02-21T06:33:42Z

previous this was reported in #12039

pxLi · 2025-02-21T07:23:35Z

cc @abellina can you help take a look at this one? many thanks~

abellina · 2025-02-21T14:48:04Z

I believe we should look at the timing of initialization here and see if we can repro it. The PinnedMemoryPool.initialize call spawns off a thread. We don't appear to be waiting for that at all in the test. I imagine it is either a race in initialization and first alloc, or a race of initialization and shutdown. Or something really odd in the environment where we can't get pinned memory.

@pxLi I right now down have time to look at this, but we can figure out if someone does.

pxLi · 2025-02-24T02:30:10Z

@firestarman reposted in the issue ticket

pxLi · 2025-02-24T03:08:51Z

Temporarily ignored the case in #12205 to unblock premerge and nightly CI for now

…12205) this is a workaound for #12194 to unblock others before we figure out the root cause of the OOM issue ``` [2025-02-24T02:56:08.571Z] �[32mHostAllocSuite:�[0m [2025-02-24T02:56:08.571Z] �[33m- simple pinned tryAlloc !!! IGNORED !!! [2025-02-24T02:56:08.571Z] �[33m- simple non-pinned tryAlloc !!! IGNORED !!! [2025-02-24T02:56:08.571Z] �[33m- simple mixed tryAlloc !!! IGNORED !!! [2025-02-24T02:56:08.571Z] �[33m- simple pinned blocking alloc !!! IGNORED !!! [2025-02-24T02:56:08.571Z] �[33m- simple non-pinned blocking alloc !!! IGNORED !!! [2025-02-24T02:56:08.571Z] �[33m- simple mixed blocking alloc !!! IGNORED !!! [2025-02-24T02:56:08.571Z] �[33m- pinned blocking alloc with spill !!! IGNORED !!! [2025-02-24T02:56:08.571Z] �[33m- non-pinned blocking alloc with spill !!! IGNORED !!! [2025-02-24T02:56:08.571Z] �[33m- mixed blocking alloc with spill !!! IGNORED !!! ``` --------- Signed-off-by: Peixin Li <pxLi@nyu.edu>

pxLi · 2025-02-24T03:31:10Z

For a succeeded log,

HostAllocSuite:
[2025-02-24 11:29:05.828] [RMM] [error] [A][Stream 0][Upstream 256B][FAILURE maximum pool size exceeded]
- simple pinned tryAlloc
[2025-02-24 11:29:06.843] [RMM] [error] [A][Stream 0][Upstream 4096B][FAILURE maximum pool size exceeded]
[2025-02-24 11:29:06.844] [RMM] [error] [A][Stream 0][Upstream 256B][FAILURE maximum pool size exceeded]
- simple non-pinned tryAlloc
[2025-02-24 11:29:07.850] [RMM] [error] [A][Stream 0][Upstream 4096B][FAILURE maximum pool size exceeded]
[2025-02-24 11:29:07.850] [RMM] [error] [A][Stream 0][Upstream 256B][FAILURE maximum pool size exceeded]
- simple mixed tryAlloc
[2025-02-24 11:29:09.248] [RMM] [error] [A][Stream 0][Upstream 256B][FAILURE maximum pool size exceeded]
- simple pinned blocking alloc
[2025-02-24 11:29:10.268] [RMM] [error] [A][Stream 0][Upstream 4096B][FAILURE maximum pool size exceeded]
[2025-02-24 11:29:10.269] [RMM] [error] [A][Stream 0][Upstream 256B][FAILURE maximum pool size exceeded]
[2025-02-24 11:29:10.279] [RMM] [error] [A][Stream 0][Upstream 256B][FAILURE maximum pool size exceeded]
- simple non-pinned blocking alloc
[2025-02-24 11:29:11.290] [RMM] [error] [A][Stream 0][Upstream 256B][FAILURE maximum pool size exceeded]
- simple mixed blocking alloc
[2025-02-24 11:29:12.313] [RMM] [error] [A][Stream 0][Upstream 256B][FAILURE maximum pool size exceeded]
- pinned blocking alloc with spill
[2025-02-24 11:29:13.346] [RMM] [error] [A][Stream 0][Upstream 4096B][FAILURE maximum pool size exceeded]
[2025-02-24 11:29:13.347] [RMM] [error] [A][Stream 0][Upstream 256B][FAILURE maximum pool size exceeded]
[2025-02-24 11:29:13.349] [RMM] [error] [A][Stream 0][Upstream 256B][FAILURE maximum pool size exceeded]
- non-pinned blocking alloc with spill
[2025-02-24 11:29:14.370] [RMM] [error] [A][Stream 0][Upstream 256B][FAILURE maximum pool size exceeded]
- mixed blocking alloc with spill
Run completed in 16 seconds, 629 milliseconds.
Total number of tests run: 9
Suites: completed 2, aborted 0
Tests: succeeded 9, failed 0, canceled 0, ignored 0, pending 0
All tests passed.

so the diff here is in the failed scenarios, it will throw error rmm log with Upstream 1073741824B

[RMM] [error] [A][Stream 0x1][Upstream 1073741824B][FAILURE maximum pool size exceeded]

firestarman · 2025-02-24T03:57:34Z

Need to revert #12205 once this error is fixed.

pxLi · 2025-02-24T04:09:52Z

Need to revert #12205 once this error is fixed.

Yep, we will keep the issue open until the real fix is in

pxLi · 2025-02-26T04:38:18Z

I believe we should look at the timing of initialization here and see if we can repro it.

I ran on different gpu instances > 100 times with revert WAR change #12205 and got <5% repros randomly,
currently, there is no method to have a stable repro... got ~30% repro on A30 instances.

the only similar thing in my rare repo is before the suite aborts, it would throw the extra info of [Upstream 1073741824B] message compared to the success run.

[2025-02-26T04:17:52.906Z] - mixed blocking alloc with spill

[2025-02-26T04:17:53.842Z] [  1308][04:17:53:609038][error ] [A][Stream 0x1][Upstream 1073741824B][FAILURE maximum pool size exceeded]

[2025-02-26T04:17:53.842Z] SUITE ABORTED - HostAllocSuite: Error initializing pinned memory pool

which seems to indicate that the issue fails during afterAll() phase of this suite while try to reinit pinned memory for other cases https://github.com/NVIDIA/spark-rapids/blob/branch-25.04/sql-plugin/src/test/scala/com/nvidia/spark/rapids/HostAllocSuite.scala#L355-L356

from the stack trace

[2025-02-26T05:25:20.015Z]   java.lang.RuntimeException: Error initializing pinned memory pool
[2025-02-26T05:25:20.015Z]   at ai.rapids.cudf.PinnedMemoryPool.getSingleton(PinnedMemoryPool.java:92)
[2025-02-26T05:25:20.015Z]   at ai.rapids.cudf.PinnedMemoryPool.getTotalPoolSizeBytes(PinnedMemoryPool.java:212)
[2025-02-26T05:25:20.015Z]   at com.nvidia.spark.rapids.HostAlloc.<init>(HostAlloc.scala:29)
[2025-02-26T05:25:20.015Z]   at com.nvidia.spark.rapids.HostAlloc$.initialize(HostAlloc.scala:267)
[2025-02-26T05:25:20.015Z]   at com.nvidia.spark.rapids.HostAllocSuite.afterAll(HostAllocSuite.scala:354)

pxLi · 2025-02-26T07:15:31Z

I created internal https://<JENKIN_URL>/job/repro-12194 job to help repro with internal A30 instances (it may require a few runs to have a repro ~50%).

Thanks!

pxLi added ? - Needs Triage Need team to review and classify bug Something isn't working test Only impacts tests labels Feb 21, 2025

pxLi mentioned this issue Feb 21, 2025

Support HiveHash in GPU partitioning #12192

Open

pxLi mentioned this issue Feb 21, 2025

[BUG] HostAllocSuite failed "Maximum pool size exceeded" in nightly UT tests #12039

Closed

pxLi changed the title ~~[BUG] HostAllocSuite failed Maximum pool size exceeded intermittently~~ [BUG] HostAllocSuite failed initializing pinned memory pool: Maximum pool size exceeded intermittently Feb 21, 2025

This was referenced Feb 24, 2025

Update jvm heap size to 6g for scala-test #12200

Closed

Workaround: tmply disable the HostAllocSuite to unblock CI [skip ci] #12205

Merged

binmahone mentioned this issue Feb 25, 2025

add test case for issue 12158 #12224

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] HostAllocSuite failed initializing pinned memory pool: Maximum pool size exceeded intermittently #12194

[BUG] HostAllocSuite failed initializing pinned memory pool: Maximum pool size exceeded intermittently #12194

pxLi commented Feb 21, 2025 •

edited

Loading

pxLi commented Feb 21, 2025

pxLi commented Feb 21, 2025

abellina commented Feb 21, 2025

pxLi commented Feb 24, 2025

pxLi commented Feb 24, 2025

pxLi commented Feb 24, 2025 •

edited

Loading

firestarman commented Feb 24, 2025

pxLi commented Feb 24, 2025

pxLi commented Feb 26, 2025 •

edited

Loading

pxLi commented Feb 26, 2025

[BUG] HostAllocSuite failed initializing pinned memory pool: Maximum pool size exceeded intermittently #12194

[BUG] HostAllocSuite failed initializing pinned memory pool: Maximum pool size exceeded intermittently #12194

Comments

pxLi commented Feb 21, 2025 • edited Loading

pxLi commented Feb 21, 2025

pxLi commented Feb 21, 2025

abellina commented Feb 21, 2025

pxLi commented Feb 24, 2025

pxLi commented Feb 24, 2025

pxLi commented Feb 24, 2025 • edited Loading

firestarman commented Feb 24, 2025

pxLi commented Feb 24, 2025

pxLi commented Feb 26, 2025 • edited Loading

pxLi commented Feb 26, 2025

pxLi commented Feb 21, 2025 •

edited

Loading

pxLi commented Feb 24, 2025 •

edited

Loading

pxLi commented Feb 26, 2025 •

edited

Loading