Skip to content

Conversation

@mbutrovich
Copy link
Contributor

Which issue does this PR close?

Closes #.

Rationale for this change

Support native scan (tested with COMET_PARQUET_SCAN_IMPL=native_datafusion) as a child. Previously it never converted the native scan child.

What changes are included in this PR?

One change to reset the firstNativeOp flag and a lot of documentation to explain why.

How are these changes tested?

Existing test but with COMET_PARQUET_SCAN_IMPL=native_datafusion.

@codecov-commenter
Copy link

codecov-commenter commented Dec 2, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 59.17%. Comparing base (f09f8af) to head (dcc5342).
⚠️ Report is 732 commits behind head on main.

Additional details and impacted files
@@             Coverage Diff              @@
##               main    #2839      +/-   ##
============================================
+ Coverage     56.12%   59.17%   +3.05%     
- Complexity      976     1490     +514     
============================================
  Files           119      167      +48     
  Lines         11743    15274    +3531     
  Branches       2251     2524     +273     
============================================
+ Hits           6591     9039    +2448     
- Misses         4012     4945     +933     
- Partials       1140     1290     +150     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Contributor

@comphead comphead left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wondering if there could be a test for this? 🤔

@mbutrovich mbutrovich requested a review from comphead December 3, 2025 20:30
@mbutrovich
Copy link
Contributor Author

wondering if there could be a test for this? 🤔

Added.

// Perform native write
df.write.parquet(outputPath)

// Wait for listener to be called with timeout
Copy link
Member

@wForget wForget Dec 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: use sparkContext.listenerBus.waitUntilEmpty() or org.scalatest.concurrent.Eventually#eventually

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I'll update the tests soon to use this approach

Copy link
Member

@wForget wForget left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @mbutrovich , LGTM

val outputDir = new File(outputPath)
val partFiles = outputDir.listFiles().filter(_.getName.startsWith("part-"))
// With 1000 rows and default parallelism, we should get multiple partitions
assert(partFiles.length > 1, "Expected multiple part files to be created")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we check exact number of partitions? example: if you write a df hash partiotined by 50 we should have 50 files

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just moved that logic. Since this is a pretty early proof-of-concept feature from @andygrove I'm not too inclined to change test behavior in this PR.

Copy link
Member

@andygrove andygrove left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@andygrove andygrove merged commit fe49e40 into apache:main Dec 4, 2025
113 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants