[BUG] spark.rapids.memory.gpu.pool=NONE is not respected with spark.rapids.sql.python.gpu.enabled=true #12228

rishic3 · 2025-02-25T22:36:16Z

Describe the bug
To disable RMM pooling, the docs say to set spark.rapids.memory.gpu.pool=NONE over the deprecated config spark.rapids.memory.gpu.pooling.enabled=false. However, the former is not respected when Python GPU scheduling is enabled, i.e. spark.rapids.sql.python.gpu.enabled=true.

Steps/Code to reproduce bug

Minimal repro with a python udf:

@pandas_udf("int")
def simple_udf(x):
    time.sleep(1)
    return -x

df = spark.createDataFrame([1, 2], schema=IntegerType())
df.withColumn("negvalue", simple_udf("value")).collect()

Running with spark.python.daemon.module=rapids.daemon and the new config spark.rapids.memory.gpu.pool=NONE:

spark.jars=$RAPIDS_JAR
spark.executorEnv.PYTHONPATH=$RAPIDS_JAR
spark.plugins=com.nvidia.spark.SQLPlugin
spark.rapids.sql.explain=ALL
spark.rapids.memory.gpu.allocFraction=0.5
spark.rapids.memory.pinnedPool.size=0

spark.python.daemon.module=rapids.daemon
spark.rapids.memory.gpu.pool=NONE

Pooling does not occur on the python worker as expected.

However, if we enable GPU scheduling for the python worker via spark.rapids.sql.python.gpu.enabled=true:

spark.jars=$RAPIDS_JAR
spark.executorEnv.PYTHONPATH=$RAPIDS_JAR
spark.plugins=com.nvidia.spark.SQLPlugin
spark.rapids.sql.explain=ALL
spark.rapids.memory.gpu.allocFraction=0.5
spark.rapids.memory.pinnedPool.size=0

spark.python.daemon.module=rapids.daemon
spark.rapids.sql.python.gpu.enabled=true  # add python gpu scheduling
spark.rapids.memory.gpu.pool=NONE

Then pooling does occur on the python worker:
> DEBUG: Pooled memory, pool size: 2965.8984375 MiB, max size: 8796093022208.0 MiB
when it should be honoring the JVM pooling conf spark.rapids.memory.gpu.pool=NONE.

Then if we revert to the old pooling config spark.rapids.memory.gpu.pooling.enabled=false:

spark.jars=$RAPIDS_JAR
spark.executorEnv.PYTHONPATH=$RAPIDS_JAR
spark.plugins=com.nvidia.spark.SQLPlugin
spark.rapids.sql.explain=ALL
spark.rapids.memory.gpu.allocFraction=0.5
spark.rapids.memory.pinnedPool.size=0

spark.python.daemon.module=rapids.daemon
spark.rapids.sql.python.gpu.enabled=true
spark.rapids.memory.gpu.pooling.enabled=false  # switch to old config

Pooling does not occur on the python worker. So with spark.rapids.sql.python.gpu.enabled=true we should be respecting the new config as well.

The text was updated successfully, but these errors were encountered:

lijinf2 · 2025-02-25T23:36:00Z

I find somehow when new and old configs co-exist, the pooling would occur too.

spark.rapids.memory.gpu.pool=NONE
spark.rapids.memory.gpu.pooling.enabled=false

will be good to consider such case in the test.

Reverts #842. rapids.daemon python workers do not respect the new pooling configs when `spark.rapids.sql.python.gpu.enabled=true` (see [this issue](NVIDIA/spark-rapids#12228)); need to use old config to disable pooling. --------- Signed-off-by: Rishi Chandra <rishic@nvidia.com>

rishic3 added ? - Needs Triage Need team to review and classify bug Something isn't working labels Feb 25, 2025

rishic3 mentioned this issue Feb 25, 2025

Revert "Update pooling confs to use spark.rapids.memory.gpu.pool" NVIDIA/spark-rapids-ml#855

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] spark.rapids.memory.gpu.pool=NONE is not respected with spark.rapids.sql.python.gpu.enabled=true #12228

[BUG] spark.rapids.memory.gpu.pool=NONE is not respected with spark.rapids.sql.python.gpu.enabled=true #12228

rishic3 commented Feb 25, 2025

lijinf2 commented Feb 25, 2025 •

edited

Loading

[BUG] spark.rapids.memory.gpu.pool=NONE is not respected with spark.rapids.sql.python.gpu.enabled=true #12228

[BUG] spark.rapids.memory.gpu.pool=NONE is not respected with spark.rapids.sql.python.gpu.enabled=true #12228

Comments

rishic3 commented Feb 25, 2025

lijinf2 commented Feb 25, 2025 • edited Loading

lijinf2 commented Feb 25, 2025 •

edited

Loading