Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] HostAllocSuite failed "Maximum pool size exceeded" in nightly UT tests #12039

Closed
pxLi opened this issue Jan 28, 2025 · 1 comment
Closed
Labels
bug Something isn't working invalid This doesn't seem right

Comments

@pxLi
Copy link
Member

pxLi commented Jan 28, 2025

Describe the bug
first seen in rapids_nightly-pre_release-github, run: 754. Please keep monitoring subsequent runs...

failed spark 321,323,331,334, and passed all other shims.

[2025-01-28T06:05:35.112Z] HostAllocSuite:

[2025-01-28T06:05:35.367Z] [2025-01-28 06:05:35.223] [RMM] [error] [A][Stream 0][Upstream 256B][FAILURE maximum pool size exceeded]

[2025-01-28T06:05:35.368Z] - simple pinned tryAlloc

[2025-01-28T06:05:36.289Z] [2025-01-28 06:05:36.241] [RMM] [error] [A][Stream 0][Upstream 4096B][FAILURE maximum pool size exceeded]

[2025-01-28T06:05:36.290Z] [2025-01-28 06:05:36.242] [RMM] [error] [A][Stream 0][Upstream 256B][FAILURE maximum pool size exceeded]

[2025-01-28T06:05:36.290Z] - simple non-pinned tryAlloc

[2025-01-28T06:05:37.647Z] [2025-01-28 06:05:37.252] [RMM] [error] [A][Stream 0][Upstream 4096B][FAILURE maximum pool size exceeded]

[2025-01-28T06:05:37.648Z] [2025-01-28 06:05:37.253] [RMM] [error] [A][Stream 0][Upstream 256B][FAILURE maximum pool size exceeded]

[2025-01-28T06:05:37.648Z] - simple mixed tryAlloc

[2025-01-28T06:05:39.007Z] [2025-01-28 06:05:38.613] [RMM] [error] [A][Stream 0][Upstream 256B][FAILURE maximum pool size exceeded]

[2025-01-28T06:05:39.008Z] - simple pinned blocking alloc

[2025-01-28T06:05:39.931Z] [2025-01-28 06:05:39.636] [RMM] [error] [A][Stream 0][Upstream 4096B][FAILURE maximum pool size exceeded]

[2025-01-28T06:05:39.932Z] [2025-01-28 06:05:39.637] [RMM] [error] [A][Stream 0][Upstream 256B][FAILURE maximum pool size exceeded]

[2025-01-28T06:05:39.932Z] [2025-01-28 06:05:39.648] [RMM] [error] [A][Stream 0][Upstream 256B][FAILURE maximum pool size exceeded]

[2025-01-28T06:05:39.932Z] - simple non-pinned blocking alloc

[2025-01-28T06:05:40.853Z] [2025-01-28 06:05:40.662] [RMM] [error] [A][Stream 0][Upstream 256B][FAILURE maximum pool size exceeded]

[2025-01-28T06:05:40.854Z] - simple mixed blocking alloc

[2025-01-28T06:05:41.781Z] [2025-01-28 06:05:41.692] [RMM] [error] [A][Stream 0][Upstream 256B][FAILURE maximum pool size exceeded]

[2025-01-28T06:05:41.782Z] - pinned blocking alloc with spill

[2025-01-28T06:05:43.139Z] [2025-01-28 06:05:42.744] [RMM] [error] [A][Stream 0][Upstream 4096B][FAILURE maximum pool size exceeded]

[2025-01-28T06:05:43.140Z] [2025-01-28 06:05:42.746] [RMM] [error] [A][Stream 0][Upstream 256B][FAILURE maximum pool size exceeded]

[2025-01-28T06:05:43.140Z] [2025-01-28 06:05:42.753] [RMM] [error] [A][Stream 0][Upstream 256B][FAILURE maximum pool size exceeded]

[2025-01-28T06:05:43.140Z] - non-pinned blocking alloc with spill

[2025-01-28T06:05:44.061Z] [2025-01-28 06:05:43.769] [RMM] [error] [A][Stream 0][Upstream 256B][FAILURE maximum pool size exceeded]

[2025-01-28T06:05:44.062Z] - mixed blocking alloc with spill

[2025-01-28T06:05:44.985Z] [2025-01-28 06:05:44.792] [RMM] [error] [A][Stream 0x1][Upstream 1073741824B][FAILURE maximum pool size exceeded]

[2025-01-28T06:05:44.985Z] SUITE ABORTED - HostAllocSuite: Error initializing pinned memory pool

[2025-01-28T06:05:44.985Z]   java.lang.RuntimeException: Error initializing pinned memory pool

[2025-01-28T06:05:44.985Z]   at ai.rapids.cudf.PinnedMemoryPool.getSingleton(PinnedMemoryPool.java:92)

[2025-01-28T06:05:44.986Z]   at ai.rapids.cudf.PinnedMemoryPool.getTotalPoolSizeBytes(PinnedMemoryPool.java:212)

[2025-01-28T06:05:44.986Z]   at com.nvidia.spark.rapids.HostAlloc.<init>(HostAlloc.scala:29)

[2025-01-28T06:05:44.986Z]   at com.nvidia.spark.rapids.HostAlloc$.initialize(HostAlloc.scala:267)

[2025-01-28T06:05:44.986Z]   at com.nvidia.spark.rapids.HostAllocSuite.afterAll(HostAllocSuite.scala:354)

[2025-01-28T06:05:44.986Z]   at org.scalatest.BeforeAndAfterAll.$anonfun$run$1(BeforeAndAfterAll.scala:225)

[2025-01-28T06:05:44.986Z]   at org.scalatest.Status.$anonfun$withAfterEffect$1(Status.scala:377)

[2025-01-28T06:05:44.986Z]   at org.scalatest.Status.$anonfun$withAfterEffect$1$adapted(Status.scala:373)

[2025-01-28T06:05:44.986Z]   at org.scalatest.CompositeStatus.whenCompleted(Status.scala:962)

[2025-01-28T06:05:44.986Z]   at org.scalatest.Status.withAfterEffect(Status.scala:373)

[2025-01-28T06:05:44.986Z]   ...

[2025-01-28T06:05:44.986Z]   Cause: java.util.concurrent.ExecutionException: java.lang.OutOfMemoryError: Could not allocate native memory: std::bad_alloc: out_of_memory: RMM failure at:/home/jenkins/agent/workspace/jenkins-spark-rapids-jni_nightly-pre_release-393-cuda11/target/libcudf/cmake-build/_deps/rmm-src/include/rmm/mr/device/pool_memory_resource.hpp:276: Maximum pool size exceeded

[2025-01-28T06:05:44.986Z]   at java.util.concurrent.FutureTask.report(FutureTask.java:122)

[2025-01-28T06:05:44.986Z]   at java.util.concurrent.FutureTask.get(FutureTask.java:192)

[2025-01-28T06:05:44.986Z]   at ai.rapids.cudf.PinnedMemoryPool.getSingleton(PinnedMemoryPool.java:90)

[2025-01-28T06:05:44.986Z]   at ai.rapids.cudf.PinnedMemoryPool.getTotalPoolSizeBytes(PinnedMemoryPool.java:212)

[2025-01-28T06:05:44.986Z]   at com.nvidia.spark.rapids.HostAlloc.<init>(HostAlloc.scala:29)

[2025-01-28T06:05:44.986Z]   at com.nvidia.spark.rapids.HostAlloc$.initialize(HostAlloc.scala:267)

[2025-01-28T06:05:44.986Z]   at com.nvidia.spark.rapids.HostAllocSuite.afterAll(HostAllocSuite.scala:354)

[2025-01-28T06:05:44.986Z]   at org.scalatest.BeforeAndAfterAll.$anonfun$run$1(BeforeAndAfterAll.scala:225)

[2025-01-28T06:05:44.986Z]   at org.scalatest.Status.$anonfun$withAfterEffect$1(Status.scala:377)

[2025-01-28T06:05:44.986Z]   at org.scalatest.Status.$anonfun$withAfterEffect$1$adapted(Status.scala:373)

[2025-01-28T06:05:44.986Z]   ...

[2025-01-28T06:05:44.986Z]   Cause: java.lang.OutOfMemoryError: Could not allocate native memory: std::bad_alloc: out_of_memory: RMM failure at:/home/jenkins/agent/workspace/jenkins-spark-rapids-jni_nightly-pre_release-393-cuda11/target/libcudf/cmake-build/_deps/rmm-src/include/rmm/mr/device/pool_memory_resource.hpp:276: Maximum pool size exceeded

[2025-01-28T06:05:44.986Z]   at ai.rapids.cudf.Rmm.newPinnedPoolMemoryResource(Native Method)

[2025-01-28T06:05:44.986Z]   at ai.rapids.cudf.PinnedMemoryPool.<init>(PinnedMemoryPool.java:225)

[2025-01-28T06:05:44.986Z]   at ai.rapids.cudf.PinnedMemoryPool.lambda$initialize$1(PinnedMemoryPool.java:142)

[2025-01-28T06:05:44.986Z]   at java.util.concurrent.FutureTask.run(FutureTask.java:266)

[2025-01-28T06:05:44.986Z]   at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)

[2025-01-28T06:05:44.986Z]   at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)

[2025-01-28T06:05:44.986Z]   at java.lang.Thread.run(Thread.java:750)

[2025-01-28T06:05:44.986Z]   ...

Steps/Code to reproduce bug
Please provide a list of steps or a code sample to reproduce the issue.
Avoid posting private or sensitive data.

Expected behavior
A clear and concise description of what you expected to happen.

Environment details (please complete the following information)

  • Environment location: [Standalone, YARN, Kubernetes, Cloud(specify cloud provider)]
  • Spark configuration settings related to the issue

Additional context
Add any other context about the problem here.

@pxLi pxLi added ? - Needs Triage Need team to review and classify bug Something isn't working labels Jan 28, 2025
@pxLi pxLi closed this as completed Jan 29, 2025
@sameerz sameerz added invalid This doesn't seem right and removed ? - Needs Triage Need team to review and classify labels Feb 17, 2025
@pxLi
Copy link
Member Author

pxLi commented Feb 21, 2025

this one was closed due to not repro for a few days, but we saw the occurence again recently so filed #12194

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working invalid This doesn't seem right
Projects
None yet
Development

No branches or pull requests

2 participants