Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] HostAllocSuite failed initializing pinned memory pool: Maximum pool size exceeded intermittently #12194

Open
pxLi opened this issue Feb 21, 2025 · 10 comments
Labels
? - Needs Triage Need team to review and classify bug Something isn't working test Only impacts tests

Comments

@pxLi
Copy link
Member

pxLi commented Feb 21, 2025

Describe the bug
this occurs in both 25.02 and 25.04 (no specific commit), and could happen with all spark shims randomly.

The gpu and host memory usage looks OK before the actual case from the metrics, so this may not be caused by OS level. We may need to increase the mvn heap size (https://github.com/NVIDIA/spark-rapids/blob/branch-25.02/pom.xml#L1453) to avoid the issue or limit memory usage in this case, thanks this is offheap memory usage

e.g. rapids_nightly-pre_release-github run: 777 (worker 2 stage),


[2025-02-21T06:23:53.796Z] Run starting. Expected test count is: 46
[2025-02-21T06:23:53.796Z] RangeConfMatcherSuite:
[2025-02-21T06:23:53.796Z] - empty
[2025-02-21T06:23:53.796Z] - bad ranges
[2025-02-21T06:23:53.796Z] - singles
[2025-02-21T06:23:53.796Z] - range only
[2025-02-21T06:23:53.796Z] - singles range mix
[2025-02-21T06:23:53.797Z] ArmSuite:
[2025-02-21T06:23:53.797Z] - withResource for Option[AutoCloseable]
[2025-02-21T06:23:53.797Z] - closeOnExcept single instance
[2025-02-21T06:23:53.797Z] - closeOnExcept sequence
[2025-02-21T06:23:53.797Z] - closeOnExcept arraybuffer
[2025-02-21T06:23:53.797Z] - closeOnExcept suppression single instance
[2025-02-21T06:23:53.797Z] - closeOnExcept suppression sequence
[2025-02-21T06:23:53.797Z] - closeOnExcept suppression arraybuffer
[2025-02-21T06:23:53.797Z] HostAllocSuite:
[2025-02-21T06:23:53.797Z] [2025-02-21 06:23:51.966] [RMM] [error] [A][Stream 0][Upstream 256B][FAILURE maximum pool size exceeded]
[2025-02-21T06:23:53.797Z] - simple pinned tryAlloc
[2025-02-21T06:23:53.797Z] [2025-02-21 06:23:52.988] [RMM] [error] [A][Stream 0][Upstream 4096B][FAILURE maximum pool size exceeded]
[2025-02-21T06:23:53.797Z] [2025-02-21 06:23:52.989] [RMM] [error] [A][Stream 0][Upstream 256B][FAILURE maximum pool size exceeded]
[2025-02-21T06:23:53.797Z] - simple non-pinned tryAlloc
[2025-02-21T06:23:54.058Z] [2025-02-21 06:23:53.999] [RMM] [error] [A][Stream 0][Upstream 4096B][FAILURE maximum pool size exceeded]
[2025-02-21T06:23:54.058Z] [2025-02-21 06:23:54.000] [RMM] [error] [A][Stream 0][Upstream 256B][FAILURE maximum pool size exceeded]
[2025-02-21T06:23:54.058Z] - simple mixed tryAlloc
[2025-02-21T06:23:55.432Z] [2025-02-21 06:23:55.421] [RMM] [error] [A][Stream 0][Upstream 256B][FAILURE maximum pool size exceeded]
[2025-02-21T06:23:55.692Z] - simple pinned blocking alloc
[2025-02-21T06:23:56.623Z] [2025-02-21 06:23:56.444] [RMM] [error] [A][Stream 0][Upstream 4096B][FAILURE maximum pool size exceeded]
[2025-02-21T06:23:56.624Z] [2025-02-21 06:23:56.446] [RMM] [error] [A][Stream 0][Upstream 256B][FAILURE maximum pool size exceeded]
[2025-02-21T06:23:56.624Z] [2025-02-21 06:23:56.456] [RMM] [error] [A][Stream 0][Upstream 256B][FAILURE maximum pool size exceeded]
[2025-02-21T06:23:56.624Z] - simple non-pinned blocking alloc
[2025-02-21T06:23:57.555Z] [2025-02-21 06:23:57.475] [RMM] [error] [A][Stream 0][Upstream 256B][FAILURE maximum pool size exceeded]
[2025-02-21T06:23:57.555Z] - simple mixed blocking alloc
[2025-02-21T06:23:58.930Z] [2025-02-21 06:23:58.509] [RMM] [error] [A][Stream 0][Upstream 256B][FAILURE maximum pool size exceeded]
[2025-02-21T06:23:58.931Z] - pinned blocking alloc with spill
[2025-02-21T06:23:59.866Z] [2025-02-21 06:23:59.574] [RMM] [error] [A][Stream 0][Upstream 4096B][FAILURE maximum pool size exceeded]
[2025-02-21T06:23:59.869Z] [2025-02-21 06:23:59.576] [RMM] [error] [A][Stream 0][Upstream 256B][FAILURE maximum pool size exceeded]
[2025-02-21T06:23:59.869Z] [2025-02-21 06:23:59.582] [RMM] [error] [A][Stream 0][Upstream 256B][FAILURE maximum pool size exceeded]
[2025-02-21T06:23:59.869Z] - non-pinned blocking alloc with spill
[2025-02-21T06:24:00.970Z] [2025-02-21 06:24:00.603] [RMM] [error] [A][Stream 0][Upstream 256B][FAILURE maximum pool size exceeded]
[2025-02-21T06:24:01.228Z] - mixed blocking alloc with spill
[2025-02-21T06:24:02.162Z] [2025-02-21 06:24:02.016] [RMM] [error] [A][Stream 0x1][Upstream 1073741824B][FAILURE maximum pool size exceeded]
[2025-02-21T06:24:02.162Z] SUITE ABORTED - HostAllocSuite: Error initializing pinned memory pool
[2025-02-21T06:24:02.162Z]   java.lang.RuntimeException: Error initializing pinned memory pool
[2025-02-21T06:24:02.162Z]   at ai.rapids.cudf.PinnedMemoryPool.getSingleton(PinnedMemoryPool.java:92)
[2025-02-21T06:24:02.162Z]   at ai.rapids.cudf.PinnedMemoryPool.getTotalPoolSizeBytes(PinnedMemoryPool.java:212)
[2025-02-21T06:24:02.162Z]   at com.nvidia.spark.rapids.HostAlloc.<init>(HostAlloc.scala:29)
[2025-02-21T06:24:02.162Z]   at com.nvidia.spark.rapids.HostAlloc$.initialize(HostAlloc.scala:267)
[2025-02-21T06:24:02.162Z]   at com.nvidia.spark.rapids.HostAllocSuite.afterAll(HostAllocSuite.scala:354)
[2025-02-21T06:24:02.162Z]   at org.scalatest.BeforeAndAfterAll.$anonfun$run$1(BeforeAndAfterAll.scala:225)
[2025-02-21T06:24:02.162Z]   at org.scalatest.Status.$anonfun$withAfterEffect$1(Status.scala:377)
[2025-02-21T06:24:02.162Z]   at org.scalatest.Status.$anonfun$withAfterEffect$1$adapted(Status.scala:373)
[2025-02-21T06:24:02.162Z]   at org.scalatest.CompositeStatus.whenCompleted(Status.scala:962)
[2025-02-21T06:24:02.162Z]   at org.scalatest.Status.withAfterEffect(Status.scala:373)
[2025-02-21T06:24:02.162Z]   ...
[2025-02-21T06:24:02.162Z]   Cause: java.util.concurrent.ExecutionException: java.lang.OutOfMemoryError: Could not allocate native memory: std::bad_alloc: out_of_memory: RMM failure at:/home/jenkins/agent/workspace/jenkins-spark-rapids-jni_nightly-pre_release-415-cuda11/target/libcudf/cmake-build/_deps/rmm-src/include/rmm/mr/device/pool_memory_resource.hpp:276: Maximum pool size exceeded
[2025-02-21T06:24:02.162Z]   at java.util.concurrent.FutureTask.report(FutureTask.java:122)
[2025-02-21T06:24:02.162Z]   at java.util.concurrent.FutureTask.get(FutureTask.java:192)
[2025-02-21T06:24:02.162Z]   at ai.rapids.cudf.PinnedMemoryPool.getSingleton(PinnedMemoryPool.java:90)
[2025-02-21T06:24:02.162Z]   at ai.rapids.cudf.PinnedMemoryPool.getTotalPoolSizeBytes(PinnedMemoryPool.java:212)
[2025-02-21T06:24:02.162Z]   at com.nvidia.spark.rapids.HostAlloc.<init>(HostAlloc.scala:29)
[2025-02-21T06:24:02.162Z]   at com.nvidia.spark.rapids.HostAlloc$.initialize(HostAlloc.scala:267)
[2025-02-21T06:24:02.162Z]   at com.nvidia.spark.rapids.HostAllocSuite.afterAll(HostAllocSuite.scala:354)
[2025-02-21T06:24:02.162Z]   at org.scalatest.BeforeAndAfterAll.$anonfun$run$1(BeforeAndAfterAll.scala:225)
[2025-02-21T06:24:02.162Z]   at org.scalatest.Status.$anonfun$withAfterEffect$1(Status.scala:377)
[2025-02-21T06:24:02.162Z]   at org.scalatest.Status.$anonfun$withAfterEffect$1$adapted(Status.scala:373)
[2025-02-21T06:24:02.162Z]   ...
[2025-02-21T06:24:02.162Z]   Cause: java.lang.OutOfMemoryError: Could not allocate native memory: std::bad_alloc: out_of_memory: RMM failure at:/home/jenkins/agent/workspace/jenkins-spark-rapids-jni_nightly-pre_release-415-cuda11/target/libcudf/cmake-build/_deps/rmm-src/include/rmm/mr/device/pool_memory_resource.hpp:276: Maximum pool size exceeded
[2025-02-21T06:24:02.162Z]   at ai.rapids.cudf.Rmm.newPinnedPoolMemoryResource(Native Method)
[2025-02-21T06:24:02.162Z]   at ai.rapids.cudf.PinnedMemoryPool.<init>(PinnedMemoryPool.java:225)
[2025-02-21T06:24:02.162Z]   at ai.rapids.cudf.PinnedMemoryPool.lambda$initialize$1(PinnedMemoryPool.java:142)
[2025-02-21T06:24:02.162Z]   at java.util.concurrent.FutureTask.run(FutureTask.java:266)
[2025-02-21T06:24:02.162Z]   at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
[2025-02-21T06:24:02.162Z]   at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
[2025-02-21T06:24:02.162Z]   at java.lang.Thread.run(Thread.java:750)
[2025-02-21T06:24:02.162Z]   ...
[2025-02-21T06:24:02.162Z] TrafficControllerSuite:
[2025-02-21T06:24:02.162Z] - schedule tasks without blocking
[2025-02-21T06:24:02.162Z] - schedule task with blocking
[2025-02-21T06:24:02.420Z] - big task should be scheduled after all running tasks are completed
[2025-02-21T06:24:03.798Z] - all tasks are bigger than the total memory limit
[2025-02-21T06:24:03.799Z] - shutdown while blocking
[2025-02-21T06:24:03.799Z] GpuDeviceManagerSuite:
[2025-02-21T06:24:03.799Z] - Test Spark gpu resource
[2025-02-21T06:24:03.799Z] - Test Spark custom resource missed
[2025-02-21T06:24:03.799Z] - Test Spark multiple GPUs throws
[2025-02-21T06:24:03.799Z] - Test Spark custom resource
[2025-02-21T06:24:03.799Z] com.nvidia.spark.rapids.shims.spark332.SparkShimsSuite:
[2025-02-21T06:24:03.799Z] - spark shims version
[2025-02-21T06:24:03.799Z] - shuffle manager class
[2025-02-21T06:24:03.799Z] - TypeSig
[2025-02-21T06:24:03.799Z] ThrottlingExecutorSuite:
[2025-02-21T06:24:03.799Z] - tasks submitted should update the state
[2025-02-21T06:24:03.799Z] - tasks submission fails if totalHostMemoryBytes exceeds maxHostMemoryBytes
[2025-02-21T06:24:03.799Z] - submit one task heavier than maxHostMemoryBytes
[2025-02-21T06:24:04.059Z] - submit multiple tasks such that totalHostMemoryBytes does not exceed maxHostMemoryBytes
[2025-02-21T06:24:04.060Z] - shutdown while a task is blocked
[2025-02-21T06:24:04.060Z] - test task metrics
[2025-02-21T06:24:04.060Z] AsyncOutputStreamSuite:
[2025-02-21T06:24:04.318Z] - open, write, and close
[2025-02-21T06:24:04.318Z] - write after closed
[2025-02-21T06:24:04.318Z] - flush after closed
[2025-02-21T06:24:04.318Z] - write after error
[2025-02-21T06:24:04.318Z] - flush after error
[2025-02-21T06:24:04.318Z] - close after error
[2025-02-21T06:24:04.318Z] ThreadFactoryBuilderTest:
[2025-02-21T06:24:04.318Z] - test thread factory builder
[2025-02-21T06:24:04.318Z] Run completed in 25 seconds, 450 milliseconds.
[2025-02-21T06:24:04.318Z] Total number of tests run: 46
[2025-02-21T06:24:04.318Z] Suites: completed 9, aborted 1
[2025-02-21T06:24:04.318Z] Tests: succeeded 46, failed 0, canceled 0, ignored 0, pending 0
[2025-02-21T06:24:04.318Z] *** 1 SUITE ABORTED ***

Steps/Code to reproduce bug
Please provide a list of steps or a code sample to reproduce the issue.
Avoid posting private or sensitive data.

Expected behavior
A clear and concise description of what you expected to happen.

Environment details (please complete the following information)

  • Environment location: [Standalone, YARN, Kubernetes, Cloud(specify cloud provider)]
  • Spark configuration settings related to the issue

Additional context
Add any other context about the problem here.

@pxLi pxLi added ? - Needs Triage Need team to review and classify bug Something isn't working test Only impacts tests labels Feb 21, 2025
@pxLi
Copy link
Member Author

pxLi commented Feb 21, 2025

previous this was reported in #12039

@pxLi
Copy link
Member Author

pxLi commented Feb 21, 2025

cc @abellina can you help take a look at this one? many thanks~

@pxLi pxLi changed the title [BUG] HostAllocSuite failed Maximum pool size exceeded intermittently [BUG] HostAllocSuite failed initializing pinned memory pool: Maximum pool size exceeded intermittently Feb 21, 2025
@abellina
Copy link
Collaborator

I believe we should look at the timing of initialization here and see if we can repro it. The PinnedMemoryPool.initialize call spawns off a thread. We don't appear to be waiting for that at all in the test. I imagine it is either a race in initialization and first alloc, or a race of initialization and shutdown. Or something really odd in the environment where we can't get pinned memory.

@pxLi I right now down have time to look at this, but we can figure out if someone does.

@pxLi
Copy link
Member Author

pxLi commented Feb 24, 2025

@firestarman reposted in the issue ticket
Image

@pxLi
Copy link
Member Author

pxLi commented Feb 24, 2025

Temporarily ignored the case in #12205 to unblock premerge and nightly CI for now

pxLi added a commit that referenced this issue Feb 24, 2025
…12205)

this is a workaound for
#12194 to unblock others
before we figure out the root cause of the OOM issue

```
[2025-02-24T02:56:08.571Z] �[32mHostAllocSuite:�[0m
[2025-02-24T02:56:08.571Z] �[33m- simple pinned tryAlloc !!! IGNORED !!!
[2025-02-24T02:56:08.571Z] �[33m- simple non-pinned tryAlloc !!! IGNORED !!!
[2025-02-24T02:56:08.571Z] �[33m- simple mixed tryAlloc !!! IGNORED !!!
[2025-02-24T02:56:08.571Z] �[33m- simple pinned blocking alloc !!! IGNORED !!!
[2025-02-24T02:56:08.571Z] �[33m- simple non-pinned blocking alloc !!! IGNORED !!!
[2025-02-24T02:56:08.571Z] �[33m- simple mixed blocking alloc !!! IGNORED !!!
[2025-02-24T02:56:08.571Z] �[33m- pinned blocking alloc with spill !!! IGNORED !!!
[2025-02-24T02:56:08.571Z] �[33m- non-pinned blocking alloc with spill !!! IGNORED !!!
[2025-02-24T02:56:08.571Z] �[33m- mixed blocking alloc with spill !!! IGNORED !!!
```

---------

Signed-off-by: Peixin Li <pxLi@nyu.edu>
@pxLi
Copy link
Member Author

pxLi commented Feb 24, 2025

For a succeeded log,

HostAllocSuite:
[2025-02-24 11:29:05.828] [RMM] [error] [A][Stream 0][Upstream 256B][FAILURE maximum pool size exceeded]
- simple pinned tryAlloc
[2025-02-24 11:29:06.843] [RMM] [error] [A][Stream 0][Upstream 4096B][FAILURE maximum pool size exceeded]
[2025-02-24 11:29:06.844] [RMM] [error] [A][Stream 0][Upstream 256B][FAILURE maximum pool size exceeded]
- simple non-pinned tryAlloc
[2025-02-24 11:29:07.850] [RMM] [error] [A][Stream 0][Upstream 4096B][FAILURE maximum pool size exceeded]
[2025-02-24 11:29:07.850] [RMM] [error] [A][Stream 0][Upstream 256B][FAILURE maximum pool size exceeded]
- simple mixed tryAlloc
[2025-02-24 11:29:09.248] [RMM] [error] [A][Stream 0][Upstream 256B][FAILURE maximum pool size exceeded]
- simple pinned blocking alloc
[2025-02-24 11:29:10.268] [RMM] [error] [A][Stream 0][Upstream 4096B][FAILURE maximum pool size exceeded]
[2025-02-24 11:29:10.269] [RMM] [error] [A][Stream 0][Upstream 256B][FAILURE maximum pool size exceeded]
[2025-02-24 11:29:10.279] [RMM] [error] [A][Stream 0][Upstream 256B][FAILURE maximum pool size exceeded]
- simple non-pinned blocking alloc
[2025-02-24 11:29:11.290] [RMM] [error] [A][Stream 0][Upstream 256B][FAILURE maximum pool size exceeded]
- simple mixed blocking alloc
[2025-02-24 11:29:12.313] [RMM] [error] [A][Stream 0][Upstream 256B][FAILURE maximum pool size exceeded]
- pinned blocking alloc with spill
[2025-02-24 11:29:13.346] [RMM] [error] [A][Stream 0][Upstream 4096B][FAILURE maximum pool size exceeded]
[2025-02-24 11:29:13.347] [RMM] [error] [A][Stream 0][Upstream 256B][FAILURE maximum pool size exceeded]
[2025-02-24 11:29:13.349] [RMM] [error] [A][Stream 0][Upstream 256B][FAILURE maximum pool size exceeded]
- non-pinned blocking alloc with spill
[2025-02-24 11:29:14.370] [RMM] [error] [A][Stream 0][Upstream 256B][FAILURE maximum pool size exceeded]
- mixed blocking alloc with spill
Run completed in 16 seconds, 629 milliseconds.
Total number of tests run: 9
Suites: completed 2, aborted 0
Tests: succeeded 9, failed 0, canceled 0, ignored 0, pending 0
All tests passed.

so the diff here is in the failed scenarios, it will throw error rmm log with Upstream 1073741824B

[RMM] [error] [A][Stream 0x1][Upstream 1073741824B][FAILURE maximum pool size exceeded]

@firestarman
Copy link
Collaborator

Need to revert #12205 once this error is fixed.

@pxLi
Copy link
Member Author

pxLi commented Feb 24, 2025

Need to revert #12205 once this error is fixed.

Yep, we will keep the issue open until the real fix is in

@pxLi
Copy link
Member Author

pxLi commented Feb 26, 2025

I believe we should look at the timing of initialization here and see if we can repro it.

I ran on different gpu instances > 100 times with revert WAR change #12205 and got <5% repros randomly,
currently, there is no method to have a stable repro...
got ~30% repro on A30 instances.

the only similar thing in my rare repo is before the suite aborts, it would throw the extra info of [Upstream 1073741824B] message compared to the success run.

[2025-02-26T04:17:52.906Z] - mixed blocking alloc with spill

[2025-02-26T04:17:53.842Z] [  1308][04:17:53:609038][error ] [A][Stream 0x1][Upstream 1073741824B][FAILURE maximum pool size exceeded]

[2025-02-26T04:17:53.842Z] SUITE ABORTED - HostAllocSuite: Error initializing pinned memory pool

which seems to indicate that the issue fails during afterAll() phase of this suite while try to reinit pinned memory for other cases https://github.com/NVIDIA/spark-rapids/blob/branch-25.04/sql-plugin/src/test/scala/com/nvidia/spark/rapids/HostAllocSuite.scala#L355-L356

from the stack trace

[2025-02-26T05:25:20.015Z]   java.lang.RuntimeException: Error initializing pinned memory pool
[2025-02-26T05:25:20.015Z]   at ai.rapids.cudf.PinnedMemoryPool.getSingleton(PinnedMemoryPool.java:92)
[2025-02-26T05:25:20.015Z]   at ai.rapids.cudf.PinnedMemoryPool.getTotalPoolSizeBytes(PinnedMemoryPool.java:212)
[2025-02-26T05:25:20.015Z]   at com.nvidia.spark.rapids.HostAlloc.<init>(HostAlloc.scala:29)
[2025-02-26T05:25:20.015Z]   at com.nvidia.spark.rapids.HostAlloc$.initialize(HostAlloc.scala:267)
[2025-02-26T05:25:20.015Z]   at com.nvidia.spark.rapids.HostAllocSuite.afterAll(HostAllocSuite.scala:354)

@pxLi
Copy link
Member Author

pxLi commented Feb 26, 2025

I created internal https://<JENKIN_URL>/job/repro-12194 job to help repro with internal A30 instances (it may require a few runs to have a repro ~50%).

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
? - Needs Triage Need team to review and classify bug Something isn't working test Only impacts tests
Projects
None yet
Development

No branches or pull requests

3 participants