Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Set parallelism for the parallelize job in recursiveListDirs #3708

Merged
merged 1 commit into from
Sep 23, 2024

Conversation

zsxwing
Copy link
Member

@zsxwing zsxwing commented Sep 23, 2024

Which Delta project/connector is this regarding?

  • Spark
  • Standalone
  • Flink
  • Kernel
  • Other (fill in here)

Description

DeltaFileOperations.recursiveListDirs calls parallelize without specifying the parallelism. Hence, it always uses the number of available cores on a cluster. When a cluster has many cores but subDirs is small, it will launch many empty tasks.

This PR makes a small change to use subDirs.length.min(spark.sparkContext.defaultParallelism) as the parallelism so that when subDirs is smaller than the number of available cores, it will not launch empty tasks.

How was this patch tested?

Existing tests.

Does this PR introduce any user-facing changes?

No

@scottsand-db scottsand-db merged commit 538e736 into delta-io:master Sep 23, 2024
14 of 17 checks passed
@zsxwing zsxwing deleted the parallelism-recursiveListDirs branch September 23, 2024 17:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants