[BUG] HostAllocSuite failed "Maximum pool size exceeded" in nightly UT tests #12039

pxLi · 2025-01-28T13:30:24Z

Describe the bug
first seen in rapids_nightly-pre_release-github, run: 754. Please keep monitoring subsequent runs...

failed spark 321,323,331,334, and passed all other shims.

[2025-01-28T06:05:35.112Z] HostAllocSuite:

[2025-01-28T06:05:35.367Z] [2025-01-28 06:05:35.223] [RMM] [error] [A][Stream 0][Upstream 256B][FAILURE maximum pool size exceeded]

[2025-01-28T06:05:35.368Z] - simple pinned tryAlloc

[2025-01-28T06:05:36.289Z] [2025-01-28 06:05:36.241] [RMM] [error] [A][Stream 0][Upstream 4096B][FAILURE maximum pool size exceeded]

[2025-01-28T06:05:36.290Z] [2025-01-28 06:05:36.242] [RMM] [error] [A][Stream 0][Upstream 256B][FAILURE maximum pool size exceeded]

[2025-01-28T06:05:36.290Z] - simple non-pinned tryAlloc

[2025-01-28T06:05:37.647Z] [2025-01-28 06:05:37.252] [RMM] [error] [A][Stream 0][Upstream 4096B][FAILURE maximum pool size exceeded]

[2025-01-28T06:05:37.648Z] [2025-01-28 06:05:37.253] [RMM] [error] [A][Stream 0][Upstream 256B][FAILURE maximum pool size exceeded]

[2025-01-28T06:05:37.648Z] - simple mixed tryAlloc

[2025-01-28T06:05:39.007Z] [2025-01-28 06:05:38.613] [RMM] [error] [A][Stream 0][Upstream 256B][FAILURE maximum pool size exceeded]

[2025-01-28T06:05:39.008Z] - simple pinned blocking alloc

[2025-01-28T06:05:39.931Z] [2025-01-28 06:05:39.636] [RMM] [error] [A][Stream 0][Upstream 4096B][FAILURE maximum pool size exceeded]

[2025-01-28T06:05:39.932Z] [2025-01-28 06:05:39.637] [RMM] [error] [A][Stream 0][Upstream 256B][FAILURE maximum pool size exceeded]

[2025-01-28T06:05:39.932Z] [2025-01-28 06:05:39.648] [RMM] [error] [A][Stream 0][Upstream 256B][FAILURE maximum pool size exceeded]

[2025-01-28T06:05:39.932Z] - simple non-pinned blocking alloc

[2025-01-28T06:05:40.853Z] [2025-01-28 06:05:40.662] [RMM] [error] [A][Stream 0][Upstream 256B][FAILURE maximum pool size exceeded]

[2025-01-28T06:05:40.854Z] - simple mixed blocking alloc

[2025-01-28T06:05:41.781Z] [2025-01-28 06:05:41.692] [RMM] [error] [A][Stream 0][Upstream 256B][FAILURE maximum pool size exceeded]

[2025-01-28T06:05:41.782Z] - pinned blocking alloc with spill

[2025-01-28T06:05:43.139Z] [2025-01-28 06:05:42.744] [RMM] [error] [A][Stream 0][Upstream 4096B][FAILURE maximum pool size exceeded]

[2025-01-28T06:05:43.140Z] [2025-01-28 06:05:42.746] [RMM] [error] [A][Stream 0][Upstream 256B][FAILURE maximum pool size exceeded]

[2025-01-28T06:05:43.140Z] [2025-01-28 06:05:42.753] [RMM] [error] [A][Stream 0][Upstream 256B][FAILURE maximum pool size exceeded]

[2025-01-28T06:05:43.140Z] - non-pinned blocking alloc with spill

[2025-01-28T06:05:44.061Z] [2025-01-28 06:05:43.769] [RMM] [error] [A][Stream 0][Upstream 256B][FAILURE maximum pool size exceeded]

[2025-01-28T06:05:44.062Z] - mixed blocking alloc with spill

[2025-01-28T06:05:44.985Z] [2025-01-28 06:05:44.792] [RMM] [error] [A][Stream 0x1][Upstream 1073741824B][FAILURE maximum pool size exceeded]

[2025-01-28T06:05:44.985Z] SUITE ABORTED - HostAllocSuite: Error initializing pinned memory pool

[2025-01-28T06:05:44.985Z]   java.lang.RuntimeException: Error initializing pinned memory pool

[2025-01-28T06:05:44.985Z]   at ai.rapids.cudf.PinnedMemoryPool.getSingleton(PinnedMemoryPool.java:92)

[2025-01-28T06:05:44.986Z]   at ai.rapids.cudf.PinnedMemoryPool.getTotalPoolSizeBytes(PinnedMemoryPool.java:212)

[2025-01-28T06:05:44.986Z]   at com.nvidia.spark.rapids.HostAlloc.<init>(HostAlloc.scala:29)

[2025-01-28T06:05:44.986Z]   at com.nvidia.spark.rapids.HostAlloc$.initialize(HostAlloc.scala:267)

[2025-01-28T06:05:44.986Z]   at com.nvidia.spark.rapids.HostAllocSuite.afterAll(HostAllocSuite.scala:354)

[2025-01-28T06:05:44.986Z]   at org.scalatest.BeforeAndAfterAll.$anonfun$run$1(BeforeAndAfterAll.scala:225)

[2025-01-28T06:05:44.986Z]   at org.scalatest.Status.$anonfun$withAfterEffect$1(Status.scala:377)

[2025-01-28T06:05:44.986Z]   at org.scalatest.Status.$anonfun$withAfterEffect$1$adapted(Status.scala:373)

[2025-01-28T06:05:44.986Z]   at org.scalatest.CompositeStatus.whenCompleted(Status.scala:962)

[2025-01-28T06:05:44.986Z]   at org.scalatest.Status.withAfterEffect(Status.scala:373)

[2025-01-28T06:05:44.986Z]   ...

[2025-01-28T06:05:44.986Z]   Cause: java.util.concurrent.ExecutionException: java.lang.OutOfMemoryError: Could not allocate native memory: std::bad_alloc: out_of_memory: RMM failure at:/home/jenkins/agent/workspace/jenkins-spark-rapids-jni_nightly-pre_release-393-cuda11/target/libcudf/cmake-build/_deps/rmm-src/include/rmm/mr/device/pool_memory_resource.hpp:276: Maximum pool size exceeded

[2025-01-28T06:05:44.986Z]   at java.util.concurrent.FutureTask.report(FutureTask.java:122)

[2025-01-28T06:05:44.986Z]   at java.util.concurrent.FutureTask.get(FutureTask.java:192)

[2025-01-28T06:05:44.986Z]   at ai.rapids.cudf.PinnedMemoryPool.getSingleton(PinnedMemoryPool.java:90)

[2025-01-28T06:05:44.986Z]   at ai.rapids.cudf.PinnedMemoryPool.getTotalPoolSizeBytes(PinnedMemoryPool.java:212)

[2025-01-28T06:05:44.986Z]   at com.nvidia.spark.rapids.HostAlloc.<init>(HostAlloc.scala:29)

[2025-01-28T06:05:44.986Z]   at com.nvidia.spark.rapids.HostAlloc$.initialize(HostAlloc.scala:267)

[2025-01-28T06:05:44.986Z]   at com.nvidia.spark.rapids.HostAllocSuite.afterAll(HostAllocSuite.scala:354)

[2025-01-28T06:05:44.986Z]   at org.scalatest.BeforeAndAfterAll.$anonfun$run$1(BeforeAndAfterAll.scala:225)

[2025-01-28T06:05:44.986Z]   at org.scalatest.Status.$anonfun$withAfterEffect$1(Status.scala:377)

[2025-01-28T06:05:44.986Z]   at org.scalatest.Status.$anonfun$withAfterEffect$1$adapted(Status.scala:373)

[2025-01-28T06:05:44.986Z]   ...

[2025-01-28T06:05:44.986Z]   Cause: java.lang.OutOfMemoryError: Could not allocate native memory: std::bad_alloc: out_of_memory: RMM failure at:/home/jenkins/agent/workspace/jenkins-spark-rapids-jni_nightly-pre_release-393-cuda11/target/libcudf/cmake-build/_deps/rmm-src/include/rmm/mr/device/pool_memory_resource.hpp:276: Maximum pool size exceeded

[2025-01-28T06:05:44.986Z]   at ai.rapids.cudf.Rmm.newPinnedPoolMemoryResource(Native Method)

[2025-01-28T06:05:44.986Z]   at ai.rapids.cudf.PinnedMemoryPool.<init>(PinnedMemoryPool.java:225)

[2025-01-28T06:05:44.986Z]   at ai.rapids.cudf.PinnedMemoryPool.lambda$initialize$1(PinnedMemoryPool.java:142)

[2025-01-28T06:05:44.986Z]   at java.util.concurrent.FutureTask.run(FutureTask.java:266)

[2025-01-28T06:05:44.986Z]   at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)

[2025-01-28T06:05:44.986Z]   at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)

[2025-01-28T06:05:44.986Z]   at java.lang.Thread.run(Thread.java:750)

[2025-01-28T06:05:44.986Z]   ...

Steps/Code to reproduce bug
Please provide a list of steps or a code sample to reproduce the issue.
Avoid posting private or sensitive data.

Expected behavior
A clear and concise description of what you expected to happen.

Environment details (please complete the following information)

Environment location: [Standalone, YARN, Kubernetes, Cloud(specify cloud provider)]
Spark configuration settings related to the issue

Additional context
Add any other context about the problem here.

The text was updated successfully, but these errors were encountered:

pxLi · 2025-02-21T06:35:00Z

this one was closed due to not repro for a few days, but we saw the occurence again recently so filed #12194

pxLi added ? - Needs Triage Need team to review and classify bug Something isn't working labels Jan 28, 2025

pxLi closed this as completed Jan 29, 2025

sameerz added invalid This doesn't seem right and removed ? - Needs Triage Need team to review and classify labels Feb 17, 2025

pxLi mentioned this issue Feb 21, 2025

[BUG] HostAllocSuite failed initializing pinned memory pool: Maximum pool size exceeded intermittently #12194

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] HostAllocSuite failed "Maximum pool size exceeded" in nightly UT tests #12039

[BUG] HostAllocSuite failed "Maximum pool size exceeded" in nightly UT tests #12039

pxLi commented Jan 28, 2025 •

edited

Loading

pxLi commented Feb 21, 2025 •

edited

Loading

[BUG] HostAllocSuite failed "Maximum pool size exceeded" in nightly UT tests #12039

[BUG] HostAllocSuite failed "Maximum pool size exceeded" in nightly UT tests #12039

Comments

pxLi commented Jan 28, 2025 • edited Loading

pxLi commented Feb 21, 2025 • edited Loading

pxLi commented Jan 28, 2025 •

edited

Loading

pxLi commented Feb 21, 2025 •

edited

Loading