Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add stat to Remove action #633

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

sebastiantia
Copy link
Collaborator

@sebastiantia sebastiantia commented Jan 9, 2025

What changes are proposed in this pull request?

This PR adds the stat field to the Remove action.
There's also a minor addition to handle partition values in Remove actions that I have included in this PR.

How was this change tested?

  • Introduced test to parse the stat field from Remove actions
  • Introduced test to parse the partition_values field from Remove actions
  • Extended data_skipping_filter test to verify Remove actions are appropriately being filtered by predicates (DataSkipping currently only filters Add actions)

resolves #567

@sebastiantia sebastiantia changed the title feat: add stat to Removeaction feat: add stat to Remove action Jan 9, 2025
kernel/src/actions/visitors.rs Outdated Show resolved Hide resolved
if let Some(path) = getters[0].get_opt(i, "remove.path")? {
self.removes.push(Self::visit_remove(i, path, getters)?);
break;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

woah I didn't realize the break was there from the beginning

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean, this has been dead code for a while, but what in the world? 🤦

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good find lol!

kernel/src/actions/visitors.rs Outdated Show resolved Hide resolved
kernel/src/actions/visitors.rs Outdated Show resolved Hide resolved
kernel/src/actions/visitors.rs Outdated Show resolved Hide resolved
@@ -515,6 +516,13 @@ async fn data_skipping_filter() {
data_change: true,
..Default::default()
}),
// Remove action with max value id = 5
Action::Remove(Remove {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd also like to see a case where the remove action is filtered out.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not even 100% sure that data skipping works on removes. If it doesn't work and there's significant work involved in making it work, we can take that up in a followup PR.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that the action above with fake_path_2 would've been filtered anyway because it's paired with an add action. So the stats wasn't what caused it to be filtered out.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, looks like you're correct. After making the path unique for the remove action, the filter does not correctly filter the action. Investigating

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Data skipping does not work reliably on removes. Even if it did work, there's technically a risk that we might filter out a remove that has stats, but fail to filter out an older add for the same file that lacks stats. For example, if the file was imported from a parquet or iceberg table, and we only added stats later.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 Thanks for your input, will note this and move on for now

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@scovich For the purposes of CDF: Suppose data skipping failed to filter the add. The rows of the add data file should eventually be filtered by the predicate. So this should be fine right?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like dataskipping looks exclusively at add actions, so this definitely is a separate PR. I do still wonder if data skipping could be leveraged for removes

Copy link

codecov bot commented Jan 10, 2025

Codecov Report

Attention: Patch coverage is 88.00000% with 9 lines in your changes missing coverage. Please review.

Project coverage is 83.68%. Comparing base (ba37b62) to head (c1bf5e1).

Files with missing lines Patch % Lines
kernel/src/actions/visitors.rs 87.83% 4 Missing and 5 partials ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #633      +/-   ##
==========================================
+ Coverage   83.43%   83.68%   +0.25%     
==========================================
  Files          75       75              
  Lines       16922    16986      +64     
  Branches    16922    16986      +64     
==========================================
+ Hits        14119    14215      +96     
+ Misses       2146     2103      -43     
- Partials      657      668      +11     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Collaborator

@OussamaSaoudi-db OussamaSaoudi-db left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a small nit on that note. Besides that, LGTM!

kernel/src/table_changes/log_replay/tests.rs Outdated Show resolved Hide resolved
Copy link
Collaborator

@zachschuermann zachschuermann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looking good! just a few comments!

kernel/src/actions/mod.rs Outdated Show resolved Hide resolved
kernel/src/actions/mod.rs Outdated Show resolved Hide resolved
kernel/src/actions/visitors.rs Outdated Show resolved Hide resolved
if let Some(path) = getters[0].get_opt(i, "remove.path")? {
self.removes.push(Self::visit_remove(i, path, getters)?);
break;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good find lol!

kernel/src/actions/visitors.rs Show resolved Hide resolved
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Remove action should have an optional stats field
4 participants