Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] spark.rapids.memory.gpu.pool=NONE is not respected with spark.rapids.sql.python.gpu.enabled=true #12228

Open
rishic3 opened this issue Feb 25, 2025 · 1 comment
Labels
? - Needs Triage Need team to review and classify bug Something isn't working

Comments

@rishic3
Copy link

rishic3 commented Feb 25, 2025

Describe the bug
To disable RMM pooling, the docs say to set spark.rapids.memory.gpu.pool=NONE over the deprecated config spark.rapids.memory.gpu.pooling.enabled=false. However, the former is not respected when Python GPU scheduling is enabled, i.e. spark.rapids.sql.python.gpu.enabled=true.

Steps/Code to reproduce bug

Minimal repro with a python udf:

@pandas_udf("int")
def simple_udf(x):
    time.sleep(1)
    return -x

df = spark.createDataFrame([1, 2], schema=IntegerType())
df.withColumn("negvalue", simple_udf("value")).collect()
  1. Running with spark.python.daemon.module=rapids.daemon and the new config spark.rapids.memory.gpu.pool=NONE:
spark.jars=$RAPIDS_JAR
spark.executorEnv.PYTHONPATH=$RAPIDS_JAR
spark.plugins=com.nvidia.spark.SQLPlugin
spark.rapids.sql.explain=ALL
spark.rapids.memory.gpu.allocFraction=0.5
spark.rapids.memory.pinnedPool.size=0

spark.python.daemon.module=rapids.daemon
spark.rapids.memory.gpu.pool=NONE

Pooling does not occur on the python worker as expected.

  1. However, if we enable GPU scheduling for the python worker via spark.rapids.sql.python.gpu.enabled=true:
spark.jars=$RAPIDS_JAR
spark.executorEnv.PYTHONPATH=$RAPIDS_JAR
spark.plugins=com.nvidia.spark.SQLPlugin
spark.rapids.sql.explain=ALL
spark.rapids.memory.gpu.allocFraction=0.5
spark.rapids.memory.pinnedPool.size=0

spark.python.daemon.module=rapids.daemon
spark.rapids.sql.python.gpu.enabled=true  # add python gpu scheduling
spark.rapids.memory.gpu.pool=NONE

Then pooling does occur on the python worker:
> DEBUG: Pooled memory, pool size: 2965.8984375 MiB, max size: 8796093022208.0 MiB
when it should be honoring the JVM pooling conf spark.rapids.memory.gpu.pool=NONE.

  1. Then if we revert to the old pooling config spark.rapids.memory.gpu.pooling.enabled=false:
spark.jars=$RAPIDS_JAR
spark.executorEnv.PYTHONPATH=$RAPIDS_JAR
spark.plugins=com.nvidia.spark.SQLPlugin
spark.rapids.sql.explain=ALL
spark.rapids.memory.gpu.allocFraction=0.5
spark.rapids.memory.pinnedPool.size=0

spark.python.daemon.module=rapids.daemon
spark.rapids.sql.python.gpu.enabled=true
spark.rapids.memory.gpu.pooling.enabled=false  # switch to old config

Pooling does not occur on the python worker. So with spark.rapids.sql.python.gpu.enabled=true we should be respecting the new config as well.

@rishic3 rishic3 added ? - Needs Triage Need team to review and classify bug Something isn't working labels Feb 25, 2025
@lijinf2
Copy link

lijinf2 commented Feb 25, 2025

I find somehow when new and old configs co-exist, the pooling would occur too.

spark.rapids.memory.gpu.pool=NONE
spark.rapids.memory.gpu.pooling.enabled=false

will be good to consider such case in the test.

eordentlich pushed a commit to NVIDIA/spark-rapids-ml that referenced this issue Feb 26, 2025
Reverts #842.
rapids.daemon python workers do not respect the new pooling configs when
`spark.rapids.sql.python.gpu.enabled=true` (see [this
issue](NVIDIA/spark-rapids#12228)); need to
use old config to disable pooling.

---------

Signed-off-by: Rishi Chandra <rishic@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
? - Needs Triage Need team to review and classify bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants