-
Notifications
You must be signed in to change notification settings - Fork 168
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bug Fix: Position Deletes + row_filter yields less data when the DataFile is large #1141
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -1238,10 +1238,13 @@ def _task_to_record_batches( | |
for batch in batches: | ||
next_index = next_index + len(batch) | ||
current_index = next_index - len(batch) | ||
output_batches = iter([batch]) | ||
if positional_deletes: | ||
# Create the mask of indices that we're interested in | ||
indices = _combine_positional_deletes(positional_deletes, current_index, current_index + len(batch)) | ||
batch = batch.take(indices) | ||
output_batches = iter([batch]) | ||
|
||
# Apply the user filter | ||
if pyarrow_filter is not None: | ||
# we need to switch back and forth between RecordBatch and Table | ||
|
@@ -1251,10 +1254,15 @@ def _task_to_record_batches( | |
arrow_table = arrow_table.filter(pyarrow_filter) | ||
if len(arrow_table) == 0: | ||
continue | ||
batch = arrow_table.to_batches()[0] | ||
yield _to_requested_schema( | ||
projected_schema, file_project_schema, batch, downcast_ns_timestamp_to_us=True, use_large_types=use_large_types | ||
) | ||
output_batches = arrow_table.to_batches() | ||
for output_batch in output_batches: | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm a bit concerned about all the nested for-loops in this function. But correctness comes first and we can always refactor later. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. You are not alone 😨 I think the long term solution is to upgrade our minimum requirement for PyArrow to 17.0.0, but as you said, I feel that we should still have a fix that works with the lower versions while we handle the longer process of discussing the version bump, and actually enforcing it slowly and together with the community |
||
yield _to_requested_schema( | ||
projected_schema, | ||
file_project_schema, | ||
output_batch, | ||
downcast_ns_timestamp_to_us=True, | ||
use_large_types=use_large_types, | ||
) | ||
|
||
|
||
def _task_to_table( | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kevinjqliu moved this assignment here for readability
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
makes sense, ty!