-
Couldn't load subscription status.
- Fork 0
Add incremental scan for appends and positional deletes #3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
license-eye has checked 363 files.
| Valid | Invalid | Ignored | Fixed |
|---|---|---|---|
| 296 | 2 | 65 | 0 |
Click to see the invalid file list
- crates/playground/Cargo.toml
- crates/playground/src/main.rs
Use this command to fix any missing license headers
```bash
docker run -it --rm -v $(pwd):/github/workspace apache/skywalking-eyes header fix
</details>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
license-eye has checked 366 files.
| Valid | Invalid | Ignored | Fixed |
|---|---|---|---|
| 299 | 2 | 65 | 0 |
Click to see the invalid file list
- crates/playground/Cargo.toml
- crates/playground/src/main.rs
Use this command to fix any missing license headers
```bash
docker run -it --rm -v $(pwd):/github/workspace apache/skywalking-eyes header fix
</details>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
license-eye has checked 366 files.
| Valid | Invalid | Ignored | Fixed |
|---|---|---|---|
| 299 | 2 | 65 | 0 |
Click to see the invalid file list
- crates/playground/Cargo.toml
- crates/playground/src/main.rs
Use this command to fix any missing license headers
```bash
docker run -it --rm -v $(pwd):/github/workspace apache/skywalking-eyes header fix
</details>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
license-eye has checked 366 files.
| Valid | Invalid | Ignored | Fixed |
|---|---|---|---|
| 299 | 2 | 65 | 0 |
Click to see the invalid file list
- crates/playground/Cargo.toml
- crates/playground/src/main.rs
Use this command to fix any missing license headers
```bash
docker run -it --rm -v $(pwd):/github/workspace apache/skywalking-eyes header fix
</details>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
license-eye has checked 366 files.
| Valid | Invalid | Ignored | Fixed |
|---|---|---|---|
| 299 | 2 | 65 | 0 |
Click to see the invalid file list
- crates/playground/Cargo.toml
- crates/playground/src/main.rs
Use this command to fix any missing license headers
```bash
docker run -it --rm -v $(pwd):/github/workspace apache/skywalking-eyes header fix
</details>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
license-eye has checked 366 files.
| Valid | Invalid | Ignored | Fixed |
|---|---|---|---|
| 299 | 2 | 65 | 0 |
Click to see the invalid file list
- crates/playground/Cargo.toml
- crates/playground/src/main.rs
Use this command to fix any missing license headers
```bash
docker run -it --rm -v $(pwd):/github/workspace apache/skywalking-eyes header fix
</details>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
license-eye has checked 366 files.
| Valid | Invalid | Ignored | Fixed |
|---|---|---|---|
| 299 | 2 | 65 | 0 |
Click to see the invalid file list
- crates/playground/Cargo.toml
- crates/playground/src/main.rs
Use this command to fix any missing license headers
```bash
docker run -it --rm -v $(pwd):/github/workspace apache/skywalking-eyes header fix
</details>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
license-eye has checked 367 files.
| Valid | Invalid | Ignored | Fixed |
|---|---|---|---|
| 299 | 3 | 65 | 0 |
Click to see the invalid file list
- crates/iceberg/src/scan/incremental/tests.rs
- crates/playground/Cargo.toml
- crates/playground/src/main.rs
Use this command to fix any missing license headers
```bash
docker run -it --rm -v $(pwd):/github/workspace apache/skywalking-eyes header fix
</details>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks Gerald, it's a big change. I went through everything, maybe the second pass I'll be able to digest the whole picture from different perspective. In the meanwhile, let me know what you think about the comments.
| .and_then(|st| st.delete_vectors.get(delete_file_path).cloned()) | ||
| } | ||
|
|
||
| pub(crate) fn with_read<F, G>(&self, f: F) -> Result<G> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what is the benefit of having this function (that has a callback) vs just doing everything inline? Seems like unnecessary optimization, and it's not a well-known pattern what with_read might mean.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Allows public read access without changing the properties.
| /// Adds a `_file` column to the RecordBatch containing the file path. | ||
| /// Uses Run-End Encoding (RLE) for maximum memory efficiency when the same | ||
| /// file path is repeated across all rows. | ||
| #[allow(dead_code)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need to allow dead_code?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, otherwise clippy complains
| batch_size: Option<usize>, | ||
| ) -> Result<ArrowRecordBatchStream> { | ||
| let schema = Arc::new(ArrowSchema::new(vec![Field::new( | ||
| "pos", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure what names we should give here. spec defines pos for positional deletes, but in this stream we're going to include rows from file deletes.
And if we're returning this, should we also add field ID according to the spec?
On the other hand, below when you add_file_path_column, it will be named _file, which is defined for data files, while positional deletes should have file_path for a name. Personally, I don't like the distinction, as it makes us handle these two streams differently (which is kinda already the case, but this could lead to off-by-one errors)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Field IDs are also different for positional deletes vs for metadata column on data file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any suggestion what to do here, or leave it for the time being?
| "Failed to create RecordBatch for DeleteVector", | ||
| ) | ||
| }) | ||
| .and_then(|batch| ArrowReader::add_file_path_column(batch, &file_path)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh so this is the only invocation? I don't think then that this PR handles Jira ticket for adding file path. That ticket should be closed only when we add the ability to add file paths to appends stream too, as well as to the regular (non-incremental) table scan.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah thought about that, that's why I did not put it into the ticket description. We still need builder methods for this one, and I haven't implemented them here.
Closes RAI-43289.
Closes RAI-43292
Closes RAI-43688
Incremental Scan Implementation
Summary
This PR introduces Incremental Scan functionality to the Iceberg Rust implementation, enabling efficient querying of changes between table snapshots. Incremental scans return the net changes (appends and deletes) between two snapshots, which is essential for incremental data processing workflows, change data capture (CDC), and efficient data pipeline operations.
Key Features
Incremental Scan API
IncrementalScanbuilder with fluent API for configuring scans between snapshots.select()for efficient data retrieval.with_batch_size()for memory optimizationFile Path Tracking
_filecolumn to all delete record batches containing the source parquet file pathRunEndEncodedarrays for memory-efficient file path storage in non-empty batches-2048)Net Change Computation
Implementation Details
Core Components
IncrementalScanBuilder (scan/incremental/mod.rs)ArrowReaderinfrastructureStreaming Implementation (arrow/incremental.rs)
StreamsIntotrait with.stream()method for converting scan tasks to Arrow record streamsFile Path Column Addition (arrow/reader.rs)
add_file_path_column()function adds_filecolumn to record batchesRestrictions
Testing
Comprehensive test suite added in scan/incremental/tests.rs:
Test Fixture
IncrementalTestFixture- Helper for creating test tables with controlled snapshotsAddoperations with custom file names and dataDeleteoperations with position and file trackingverify_incremental_scan()for asserting expected resultsTest Coverage
test_incremental_fixture_simple- Basic append and delete operationstest_incremental_fixture_complex- Multiple snapshots with overlapping operationstest_incremental_scan_edge_cases- Edge cases across 7 snapshots and 3 data filestest_incremental_scan_builder_options- Builder API functionality.select())test_add_file_path_column- Unit tests for file path column additionAll tests passing: ✅ 4 incremental scan tests, ✅ 3 file path column tests
API Example