-
Notifications
You must be signed in to change notification settings - Fork 54
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: add stat
to Remove
action
#633
base: main
Are you sure you want to change the base?
feat: add stat
to Remove
action
#633
Conversation
stat
to Remove
actionstat
to Remove
action
if let Some(path) = getters[0].get_opt(i, "remove.path")? { | ||
self.removes.push(Self::visit_remove(i, path, getters)?); | ||
break; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
woah I didn't realize the break was there from the beginning
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean, this has been dead code for a while, but what in the world? 🤦
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good find lol!
@@ -515,6 +516,13 @@ async fn data_skipping_filter() { | |||
data_change: true, | |||
..Default::default() | |||
}), | |||
// Remove action with max value id = 5 | |||
Action::Remove(Remove { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd also like to see a case where the remove action is filtered out.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not even 100% sure that data skipping works on removes. If it doesn't work and there's significant work involved in making it work, we can take that up in a followup PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that the action above with fake_path_2
would've been filtered anyway because it's paired with an add action. So the stats wasn't what caused it to be filtered out.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, looks like you're correct. After making the path unique for the remove
action, the filter does not correctly filter the action. Investigating
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Data skipping does not work reliably on removes. Even if it did work, there's technically a risk that we might filter out a remove that has stats, but fail to filter out an older add for the same file that lacks stats. For example, if the file was imported from a parquet or iceberg table, and we only added stats later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 Thanks for your input, will note this and move on for now
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@scovich For the purposes of CDF: Suppose data skipping failed to filter the add. The rows of the add data file should eventually be filtered by the predicate. So this should be fine right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like dataskipping looks exclusively at add actions, so this definitely is a separate PR. I do still wonder if data skipping could be leveraged for removes
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #633 +/- ##
==========================================
+ Coverage 83.43% 83.68% +0.25%
==========================================
Files 75 75
Lines 16922 16986 +64
Branches 16922 16986 +64
==========================================
+ Hits 14119 14215 +96
+ Misses 2146 2103 -43
- Partials 657 668 +11 ☔ View full report in Codecov by Sentry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a small nit on that note. Besides that, LGTM!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looking good! just a few comments!
if let Some(path) = getters[0].get_opt(i, "remove.path")? { | ||
self.removes.push(Self::visit_remove(i, path, getters)?); | ||
break; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good find lol!
a59b383
to
c1bf5e1
Compare
What changes are proposed in this pull request?
This PR adds the
stat
field to theRemove
action.There's also a minor addition to handle partition values in
Remove
actions that I have included in this PR.How was this change tested?
stat
field fromRemove
actionspartition_values
field fromRemove
actionsExtended(DataSkipping currently only filtersdata_skipping_filter
test to verifyRemove
actions are appropriately being filtered by predicatesAdd
actions)resolves #567