Bug Fix: Position Deletes + row_filter yields less data when the DataFile is large #1141

sungwy · 2024-09-06T01:19:48Z

Culprit was the awkward zero-copy pass back and forth between RecordBatch and Table to filter by pyarrow_filter.

Lines 1246 to 1254 in 0dc5408

    
           if pyarrow_filter is not None: 
        
               # we need to switch back and forth between RecordBatch and Table 
        
               # as Expression filter isn't yet supported in RecordBatch 
        
               # https://github.com/apache/arrow/issues/39220 
        
               arrow_table = pa.Table.from_batches([batch]) 
        
               arrow_table = arrow_table.filter(pyarrow_filter) 
        
               if len(arrow_table) == 0: 
        
                   continue 
        
               batch = arrow_table.to_batches()[0]

As noted, this is necessary because passing an expression filter to a RecordBatch hasn't yet been exposed (it is now in 17.0.0, but that would raise the lower limit of PyArrow dependency that much higher).

In cases where a single RecordBatch -> a Table -> multiple RecordBatches because of how Arrow automatically chunks a Table into multiple RecordBatches, we would lose the remaining RecordBatches in the returned output. Essentially, there's no guarantee that a single RecordBatch will remain a single one in a round trip conversion to Arrow Table, and back.

The new test covers a case where the amount of data within a DataFile is large enough for the arrow Table yields multiple RecordBatch

kevinjqliu · 2024-09-13T23:06:51Z

In cases where a single RecordBatch -> a Table -> multiple RecordBatches because of how Arrow automatically chunks a Table into multiple RecordBatches, we would lose the remaining RecordBatches in the returned output. Essentially, there's no guarantee that a single RecordBatch will remain a single one in a round trip conversion to Arrow Table, and back.

This is specifically about this piece of code?

     batch = arrow_table.to_batches()[0]

kevinjqliu

LGTM! Thanks for taking the time to find this bug.

kevinjqliu · 2024-09-13T23:11:41Z

pyiceberg/io/pyarrow.py

+                else:
+                    output_batches = iter([batch])


nit: redundant with output_batches already initialized to the same value

It is unfortunately different because of line 1245:

batch = batch.take(indices)

ah, didnt see that.
Do you mind adding a comment here to mention the possible batch mutation? I feel like this might be an easy footgun later

Moved the variable assignment to L1246 for readability

kevinjqliu · 2024-09-13T23:12:40Z

pyiceberg/io/pyarrow.py

            if positional_deletes:
                # Create the mask of indices that we're interested in
                indices = _combine_positional_deletes(positional_deletes, current_index, current_index + len(batch))
                batch = batch.take(indices)
+
                # Apply the user filter
                if pyarrow_filter is not None:


a little unrelated to this PR, but I wonder if there's ever a case where we need to filter (with pyarrow_filter) but positional_deletes is None/empty/falsey.
I can see that as a potential issue

Great question! That's already handled here:

iceberg-python/pyiceberg/io/pyarrow.py

Lines 1221 to 1234 in 0dc5408

fragment_scanner = ds.Scanner.from_fragment(

fragment=fragment,

# With PyArrow 16.0.0 there is an issue with casting record-batches:

# https://github.com/apache/arrow/issues/41884

# https://github.com/apache/arrow/issues/43183

# Would be good to remove this later on

schema=_pyarrow_schema_ensure_large_types(physical_schema)

if use_large_types

else (_pyarrow_schema_ensure_small_types(physical_schema)),

# This will push down the query to Arrow.

# But in case there are positional deletes, we have to apply them first

filter=pyarrow_filter if not positional_deletes else None,

columns=[col.name for col in file_project_schema.columns],

)

thanks!
So if there are positional_deletes, apply filter manually to the pyarrow table.
If there are no positional_deletes, pushdown the filter to scanner.

arrow 17 should make this a lot cleaner

kevinjqliu · 2024-09-13T23:14:01Z

pyiceberg/io/pyarrow.py

+                    output_batches = arrow_table.to_batches()
+                else:
+                    output_batches = iter([batch])
+            for output_batch in output_batches:


I'm a bit concerned about all the nested for-loops in this function. But correctness comes first and we can always refactor later.

You are not alone 😨

I think the long term solution is to upgrade our minimum requirement for PyArrow to 17.0.0, but as you said, I feel that we should still have a fix that works with the lower versions while we handle the longer process of discussing the version bump, and actually enforcing it slowly and together with the community

sungwy · 2024-09-13T23:29:16Z

In cases where a single RecordBatch -> a Table -> multiple RecordBatches because of how Arrow automatically chunks a Table into multiple RecordBatches, we would lose the remaining RecordBatches in the returned output. Essentially, there's no guarantee that a single RecordBatch will remain a single one in a round trip conversion to Arrow Table, and back.

This is specifically about this piece of code?
     batch = arrow_table.to_batches()[0] 

Yes, that's right - it was so hard to understand this behavior and test it until we found users with large enough data to actually run into the issues

In retrospect, I think assuming that it will always be length of 1 is erroneous, but it wasn't easy to conceive that a RecordBatch would yield more RecordBatches in a zero-copy roundtrip conversion to Arrow Table and back

sungwy · 2024-09-16T15:02:34Z

pyiceberg/io/pyarrow.py

            if positional_deletes:
                # Create the mask of indices that we're interested in
                indices = _combine_positional_deletes(positional_deletes, current_index, current_index + len(batch))
                batch = batch.take(indices)
+                output_batches = iter([batch])


@kevinjqliu moved this assignment here for readability

makes sense, ty!

add more tests for position deletes

5ad78f1

sungwy mentioned this pull request Sep 6, 2024

Inconsistent row count across versions #1132

Open

sungwy added 3 commits September 13, 2024 03:02

multiple append and deletes

8a31942

test

3a7dadb

fix

751fe10

sungwy marked this pull request as ready for review September 13, 2024 22:23

sungwy requested a review from Fokko September 13, 2024 22:27

sungwy changed the title ~~add more tests for position deletes~~ Bug Fix: Position Deletes + row_filter yields less data when the DataFile is large Sep 13, 2024

kevinjqliu approved these changes Sep 13, 2024

View reviewed changes

adopt nit

3489fce

sungwy commented Sep 16, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug Fix: Position Deletes + row_filter yields less data when the DataFile is large #1141

Bug Fix: Position Deletes + row_filter yields less data when the DataFile is large #1141

sungwy commented Sep 6, 2024 •

edited

Loading

kevinjqliu commented Sep 13, 2024

kevinjqliu left a comment

kevinjqliu Sep 13, 2024

sungwy Sep 13, 2024

kevinjqliu Sep 13, 2024

sungwy Sep 16, 2024

kevinjqliu Sep 13, 2024

sungwy Sep 13, 2024

kevinjqliu Sep 13, 2024

kevinjqliu Sep 13, 2024

sungwy Sep 13, 2024

sungwy commented Sep 13, 2024 •

edited

Loading

sungwy Sep 16, 2024

kevinjqliu Sep 16, 2024

	if pyarrow_filter is not None:
	# we need to switch back and forth between RecordBatch and Table
	# as Expression filter isn't yet supported in RecordBatch
	# https://github.com/apache/arrow/issues/39220
	arrow_table = pa.Table.from_batches([batch])
	arrow_table = arrow_table.filter(pyarrow_filter)
	if len(arrow_table) == 0:
	continue
	batch = arrow_table.to_batches()[0]

	fragment_scanner = ds.Scanner.from_fragment(
	fragment=fragment,
	# With PyArrow 16.0.0 there is an issue with casting record-batches:
	# https://github.com/apache/arrow/issues/41884
	# https://github.com/apache/arrow/issues/43183
	# Would be good to remove this later on
	schema=_pyarrow_schema_ensure_large_types(physical_schema)
	if use_large_types
	else (_pyarrow_schema_ensure_small_types(physical_schema)),
	# This will push down the query to Arrow.
	# But in case there are positional deletes, we have to apply them first
	filter=pyarrow_filter if not positional_deletes else None,
	columns=[col.name for col in file_project_schema.columns],
	)

Bug Fix: Position Deletes + row_filter yields less data when the DataFile is large #1141

Are you sure you want to change the base?

Bug Fix: Position Deletes + row_filter yields less data when the DataFile is large #1141

Conversation

sungwy commented Sep 6, 2024 • edited Loading

kevinjqliu commented Sep 13, 2024

kevinjqliu left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sungwy commented Sep 13, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sungwy commented Sep 6, 2024 •

edited

Loading

sungwy commented Sep 13, 2024 •

edited

Loading