-
Notifications
You must be signed in to change notification settings - Fork 0
Port addition of _file column #10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…erg-rust into feature/gb/file-column
…erg-rust into feature/gb/file-column
This reverts commit adf0da0.
…erg-rust into feature/gb/file-column
| // The _file column will be a RunEndEncoded array with the file path | ||
| if let Some(run_array) = file_col | ||
| .as_any() | ||
| .downcast_ref::<arrow_array::RunArray<arrow_array::types::Int32Type>>() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
perhaps just downcast to the int array and iterate through values, instead of having to separate run array values from run array run ends
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also shouldn't we not have REE here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Regarding the not REE stuff: No, I think in a single PR, we should have REE, and if we want to disable it we should have a separate one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah, ok, so you'd follow up with that right after this PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes
| { | ||
| record_batch_transformer_builder = | ||
| record_batch_transformer_builder.with_partition(partition_spec, partition_data); | ||
| record_batch_transformer_builder.with_partition(partition_spec, partition_data)?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do you want to do this for incremental scan as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We do not have partition information (yet? not sure if it is needed at all) in the incremental scan
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would we get wrong results because of it? If table had partitions and some partition transforms for example?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure, this partition stuff has only been added recently. We may just add the same logic to the incremental tasks, but we first need to understand what's the actual issue
|
|
||
| for column_name in column_names.iter() { | ||
| // Handle metadata columns (like "_file") | ||
| if is_metadata_column_name(column_name) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it already shows that it's a burden to have to maintain same features in two places (not only code but also tests)...at some point we should maybe reconsider how to consolidate this
| /// The Arrow Field definition for the metadata column, or an error if not a metadata field | ||
| pub fn get_metadata_field(field_id: i32) -> Result<Arc<Field>> { | ||
| match field_id { | ||
| RESERVED_FIELD_ID_FILE => Ok(Arc::clone(file_field())), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what about other fields? Also should they have static lazy field?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It depends on what the use of these fields is. Since this is a constant arrow field we can create it. But I leave the others until they are actually used.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we should include it for the sake of #10 (comment). Otherwise it's really tactile and error-prone
| pub fn get_metadata_field(field_id: i32) -> Result<Arc<Field>> { | ||
| match field_id { | ||
| RESERVED_FIELD_ID_FILE => Ok(Arc::clone(file_field())), | ||
| _ if is_metadata_field(field_id) => { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we include other fields as suggested in the comment above, this will never fire. I think we have to extend is_metadata_field to include all metadata fields (a range), and that way we would throw this proper error.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I left implementation of the other fields out intentionally, hence this fires. The functions here do not have to return actual fields for all constants, rather metadata_columns.rs is in general a module for collecting all sorts of constants/helpers. It is for example not clear whether you would ever want a field for _pos, because this comes from the arrow reader. However, we still want to collect the constants here...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We talked offline and agreed to remove non used constants, and for those that are used we should introduce Lazy Fields. Then we can address this comment I think.
…iceberg-rust into feature/gb/file-column-inc
vustef
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks Gerald.
I'm not particularly clear on metadata column fields listing, but otherwise looks good. We should follow up with disabling REE right after this PR though, unless you want to pick Arrow.jl battle to support REE there.
Closes RAI-43688
Closes RAI-43694
Upstream PR is here: apache#1824