[SPARK-53633][SQL] Reuse InputStream in vectorized Parquet reader #52384

pan3793 · 2025-09-18T08:31:17Z

What changes were proposed in this pull request?

Reuse InputStream in vectorized Parquet reader between reading the footer and row groups, on the executor side.

This PR is part of SPARK-52011, you can check more details at #50765

Why are the changes needed?

Reduce unnecessary RPCs of NameNode to improve performance and stability for large Hadoop clusters.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

See #50765

Was this patch authored or co-authored using generative AI tooling?

No.

pan3793 · 2025-09-18T08:34:42Z

cc @sunchao @cloud-fan @LuciferYang @viirya

as explained in #50765 (comment), I spilt the executor-side changes to a dedicated PR, please take a look when you have time, thank you in advance.

cloud-fan · 2025-09-19T01:40:18Z

.../src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFooterReader.scala

+import org.apache.spark.sql.execution.datasources.PartitionedFile
+import org.apache.spark.util.Utils
+
+object ParquetFooterReader {


is there a strong reason to rewrite it from java to scala?

over 2/3 of the code in the original file is removed, and the newly added method openFileAndReadFooter uses Scala Tuple, Option, which is ugly if written in Java

we should create a java record to wrap it instead of using a tuple...

@cloud-fan I created a case class OpenedParquetFooter to replace the tuple, please let me know if you have better idea.

Then why not keep it as Java then? Java is more AI friendly and I'm a bit hesitant to turn existing Java code to Scala.

Okay, let me convert it back.

@cloud-fan done, I converted it to Java now.

cloud-fan · 2025-09-19T01:49:09Z

...re/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala

+      val split = new FileSplit(file.toPath, file.start, file.length, Array.empty[String])
      val sharedConf = broadcastedHadoopConf.value.value

-      val fileFooter = if (enableVectorizedReader) {


This buildReaderWithPartitionValues method is super long now, can we create some smaller methods to split it, so that this PR is easier to review?

the actual change is relatively small if you enable the "Hide whitespace"

yea we can use this trick to help review, but my point is that this method is too long and we need to split it sooner or later. Since we are changing it now, maybe we should also split it now?

3 code blocks were extracted from this method as independent methods.

...re/src/main/java/org/apache/spark/sql/execution/datasources/parquet/OpenedParquetFooter.java

sunchao

LGTM mostly - just a few nits!

sunchao · 2025-09-22T22:23:15Z

...re/src/main/java/org/apache/spark/sql/execution/datasources/parquet/ParquetFooterReader.java

  /**
-   * Reads footer for the input Parquet file 'split'. If 'skipRowGroup' is true,
-   * this will skip reading the Parquet row group metadata.
+   * Build a filter for reading footer of the input Parquet file 'split'.


nit: I think the doc is out-dated - there is no 'split'

thing has not changed, the 'split' represents PartitionedFile here

sunchao · 2025-09-22T22:25:54Z

...re/src/main/java/org/apache/spark/sql/execution/datasources/parquet/ParquetFooterReader.java

+      PartitionedFile file,
+      boolean keepInputStreamOpen) throws IOException {
+    var readOptions = HadoopReadOptions.builder(hadoopConf, file.toPath())
+        .withMetadataFilter(buildFilter(hadoopConf, file, !keepInputStreamOpen))


maybe worth adding some comments here to explain why we choose to skip row groups when keepInputStreamOpen is false

added a comment

sunchao · 2025-09-22T22:29:13Z

...re/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala

+      // Before transferring the ownership of inputStream to the vectorizedReader,
+      // we must take responsibility to close the inputStream if something goes wrong
+      // to avoid resource leak.
+      val shouldCloseInputStream = new AtomicBoolean(openedFooter.inputStreamOpt.isPresent)


curious why this needs to be an AtomicBoolean? also is an boolean flag needed? can we just do

openedFooter.inputStreamOpt.ifPresent(Utils.closeQuietly)

we pass shouldCloseInputStream to def buildVectorizedIterator and the flag will be updated by that method, so we must use a reference instead of a primitive type.

the suggestion works but will introduce many unnecessary close() in normal cases, so I add a flag to avoid that as much as possible.

sunchao · 2025-09-22T22:31:38Z

...re/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala

    }
  }

+  // scalastyle:off argcount


I assume the following changes are just refactoring?

effective code at L338-344, others are just method extraction

dongjoon-hyun

+1, LGTM. Thank you, @pan3793 , @cloud-fan , @sunchao .

Merged to master for Apache Spark 4.1.0-preview2.

[SPARK-53633][SQL] Reuse InputStream in vectorized Parquet reader

3cc3dc2

github-actions bot added the SQL label Sep 18, 2025

cloud-fan reviewed Sep 19, 2025

View reviewed changes

pan3793 added 4 commits September 19, 2025 13:42

break long method buildReaderWithPartitionValues

2377158

restore comment

ff3682a

OpenedParquetFooter

e21983e

Fix assertion

9344c0f

pan3793 requested a review from cloud-fan September 22, 2025 05:31

covert ParquetFooterReader back to Java

cc9f9e0

cloud-fan reviewed Sep 22, 2025

View reviewed changes

...re/src/main/java/org/apache/spark/sql/execution/datasources/parquet/OpenedParquetFooter.java Outdated Show resolved Hide resolved

cloud-fan approved these changes Sep 22, 2025

View reviewed changes

pan3793 added 2 commits September 22, 2025 20:49

Fix import order

d9e8b5e

fix style

9266816

sunchao reviewed Sep 22, 2025

View reviewed changes

comment

c9ee046

sunchao approved these changes Sep 23, 2025

View reviewed changes

dongjoon-hyun approved these changes Sep 23, 2025

View reviewed changes

dongjoon-hyun closed this in e95f12b Sep 23, 2025

[SPARK-53633][SQL] Reuse InputStream in vectorized Parquet reader #52384

[SPARK-53633][SQL] Reuse InputStream in vectorized Parquet reader #52384

Conversation

pan3793 commented Sep 18, 2025

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

pan3793 commented Sep 18, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pan3793 Sep 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

sunchao left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pan3793 Sep 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

pan3793 Sep 19, 2025 •

edited

Loading

pan3793 Sep 23, 2025 •

edited

Loading