feat(core): Add support for `_file` column #1824

gbrgr · 2025-11-04T09:06:43Z

Which issue does this PR close?

Closes Support for _file metadata column #1766.

What changes are included in this PR?

Integrates virtual field handling for the _file metadata column into RecordBatchTransformer using a pre-computed constants map, eliminating post-processing and duplicate lookups.

Key Changes

New metadata_columns.rs module: Centralized utilities for metadata columns

Constants: RESERVED_FIELD_ID_FILE, RESERVED_COL_NAME_FILE
Helper functions: get_metadata_column_name(), get_metadata_field_id(), is_metadata_field(), is_metadata_column_name()

Enhanced RecordBatchTransformer:

Added constant_fields: HashMap<i32, (DataType, PrimitiveLiteral)> - pre-computed during initialization
New with_constant() method - computes Arrow type once during setup
Updated to use pre-computed types and values (avoids duplicate lookups)
Handles DataType::RunEndEncoded for constant strings (memory efficient)

Simplified reader.rs:

Pass full project_field_ids (including virtual) to RecordBatchTransformer
Single with_constant() call to register _file column
Removed post-processing loop

Updated scan/mod.rs:

Use is_metadata_column_name() and get_metadata_field_id() instead of hardcoded checks

Are these changes tested?

Yes, comprehensive tests have been added to verify the functionality:

New Tests (7 tests added)

Table Scan API Tests (7 tests)

test_select_with_file_column - Verifies basic functionality of selecting _file with regular columns
test_select_file_column_position - Verifies column ordering is preserved
test_select_file_column_only - Tests selecting only the _file column
test_file_column_with_multiple_files - Tests multiple data files scenario
test_file_column_at_start - Tests _file at position 0
test_file_column_at_end - Tests _file at the last position
test_select_with_repeated_column_names - Tests repeated column selection

crates/iceberg/src/scan/mod.rs

vustef · 2025-11-04T11:15:37Z

crates/iceberg/src/scan/mod.rs

+/// // Select regular columns along with the file path
+/// let scan = table
+///     .scan()
+///     .select(["id", "name", RESERVED_COL_NAME_FILE])


How do we ask for _file column without having to explicitly list all the other columns? E.g. get me all columns + _file. There should be some shortcut for this.

crates/iceberg/src/scan/mod.rs

crates/iceberg/src/arrow/reader.rs

…erg-rust into feature/gb/file-column

liurenjie1024

Thanks @gbrgr for this pr. But I think we need to rethink how to compute the _file, _pos metadata column. While it's somehow trivial to compute _file, it's non trivial to compute _pos efficient, since when we read parquet files, we have filtered out some row groups. I think the best way is to push reading these two columns to arrow-rs.

liurenjie1024 · 2025-11-06T10:15:04Z

crates/iceberg/src/arrow/reader.rs

+pub(crate) const RESERVED_FIELD_ID_FILE: i32 = 2147483646;
+
+/// Column name for the file path metadata column per Iceberg spec
+pub(crate) const RESERVED_COL_NAME_FILE: &str = "_file";


I think we should add a metadata_columns module similar to what java did: https://github.com/apache/iceberg/blob/bb32b90c4ad63f037f0bda197cc53fb08c886934/core/src/main/java/org/apache/iceberg/MetadataColumns.java#L2

vustef · 2025-11-06T10:23:35Z

Thanks @gbrgr for this pr. But I think we need to rethink how to compute the _file, _pos metadata column. While it's somehow trivial to compute _file, it's non trivial to compute _pos efficient, since when we read parquet files, we have filtered out some row groups. I think the best way is to push reading these two columns to arrow-rs.

@liurenjie1024 I agree for _pos, and we have a PR there: apache/arrow-rs#8715
But _file seems like something that we don't need the arrow-rs to know about. Similarly, in future, for _row_id from V3 spec, we cannot expect arrow-rs to be responsible for computing that one.

How do we go forward with rethinking this, what would be the action items for us?

…erg-rust into feature/gb/file-column

crates/iceberg/src/arrow/metadata_column_transformer.rs

crates/iceberg/src/arrow/reader.rs

This reverts commit adf0da0.

liurenjie1024

Thanks @gbrgr for this pr, I left some comments to improve.

liurenjie1024 · 2025-11-10T01:15:32Z

crates/iceberg/src/scan/mod.rs

+/// # Ok(())
+/// # }
+/// ```
+pub const RESERVED_COL_NAME_FILE: &str = RESERVED_COL_NAME_FILE_INTERNAL;


We will have more metadata columns, so I would prefert to put these definition in sth like metadata_columns module.

Added a new module

liurenjie1024 · 2025-11-10T01:22:27Z

crates/iceberg/src/scan/mod.rs

        if let Some(column_names) = self.column_names.as_ref() {
            for column_name in column_names {
+                // Skip reserved columns that don't exist in the schema
+                if column_name == RESERVED_COL_NAME_FILE_INTERNAL {


We should have sth like is_metadata_column_name() in metadata_columns module, and useis_metadata_column_name so that we could avoid such changes when we add more metadata columns.

Done in the new module

liurenjie1024 · 2025-11-10T01:32:32Z

crates/iceberg/src/arrow/reader.rs


+    /// Helper function to add a `_file` column to a RecordBatch at a specific position.
+    /// Takes the array, field to add, and position where to insert.
+    fn create_file_field_at_position(


I think this approach is not extensible. I prefer what's similar in this pr:

Add constant_map for ArrowReader

Add another variant of RecordBatchTransformer to handle constant field like _file

I sketched an approach in the record batch transformer, I took the path of the transformer just having a constant_map stored which can be populated by the reader.

liurenjie1024 · 2025-11-10T01:34:46Z

Thanks @gbrgr for this pr. But I think we need to rethink how to compute the _file, _pos metadata column. While it's somehow trivial to compute _file, it's non trivial to compute _pos efficient, since when we read parquet files, we have filtered out some row groups. I think the best way is to push reading these two columns to arrow-rs.

@liurenjie1024 I agree for _pos, and we have a PR there: apache/arrow-rs#8715 But _file seems like something that we don't need the arrow-rs to know about. Similarly, in future, for _row_id from V3 spec, we cannot expect arrow-rs to be responsible for computing that one.

How do we go forward with rethinking this, what would be the action items for us?

Hi, @vustef I also agree that we should put _file in iceberg-rust, and I left some comments about how to proceed.

liurenjie1024

Thanks @gbrgr for this pr, generally LGTM! Please help to resolve conflicts.

liurenjie1024 · 2025-11-12T10:58:55Z

crates/iceberg/src/arrow/record_batch_transformer.rs

    projected_iceberg_field_ids: Vec<i32>,
+    // Pre-computed constant field information: field_id -> (arrow_type, value)
+    // Avoids duplicate lookups and type conversions during batch processing
+    constant_fields: HashMap<i32, (DataType, PrimitiveLiteral)>,


We have Datum type exactly for DataType + PrimitiveLiteral.

I think Datum is rather PrimitiveType + PrimitiveLiteral, but here we have the Arrow DataType.

What's the benefit of using arrow's DataType here?

liurenjie1024 · 2025-11-12T11:05:38Z

crates/iceberg/src/metadata_columns.rs

+use crate::{Error, ErrorKind, Result};
+
+/// Reserved field ID for the file path (_file) column per Iceberg spec
+pub const RESERVED_FIELD_ID_FILE: i32 = 2147483646;


Please help to create an issue for porting all fields of MetadataColumns here.

crates/iceberg/src/metadata_columns.rs

liurenjie1024 · 2025-11-13T10:12:04Z

crates/iceberg/src/metadata_columns.rs

+
+/// Reserved column name for the file path metadata column
+pub const RESERVED_COL_NAME_FILE: &str = "_file";
+


Please create a lazy field for FILE_PATH, you can take an example here:

iceberg-rust/crates/iceberg/src/spec/manifest/entry.rs

Line 183 in 4eafd2c

static STATUS: Lazy<NestedFieldRef> = {

Also please don't expose the static field directly, use a method to expose the field reference.

I created an Arrow lazy field (not an Iceberg field) in metadata_columns.rs

We should not create arrow field. Please remember that iceberg tries to be engine independent, and such core abstractions should use iceberg's own abastractions.

liurenjie1024 · 2025-11-13T10:14:48Z

crates/iceberg/src/arrow/record_batch_transformer.rs

                let vals: Vec<Option<f64>> = vec![None; num_rows];
                Arc::new(Float64Array::from(vals))
            }
+            (DataType::RunEndEncoded(_, _), Some(PrimitiveLiteral::String(value))) => {


Should we in general encode constant columns as REE? Or should we make this custom per field? For the file path it definitely makes sense to run-end encode.

+1. I can't come up with a reason why we don't do this.

gbrgr · 2025-11-14T08:18:42Z

@liurenjie1024 I now resolved the merge conflicts that stem from PR #1821:

I removed partition information stored in the RecordBatchTransformer, and instead generically store constant fields in it which can be used to add column sources.
I changed all added columns to be REE encoded, as they are all constant values.

…erg-rust into feature/gb/file-column

crates/iceberg/src/scan/mod.rs

vustef · 2025-11-17T10:50:35Z

crates/iceberg/src/arrow/record_batch_transformer.rs

+        // Helper to create REE type with the given values type
+        // Note: values field is nullable as Arrow expects this when building the
+        // final Arrow schema with `RunArray::try_new`.
+        let make_ree = |values_type: DataType| -> DataType {


I'd limit REEs only to this method, others are not really related to the PR

gbrgr · 2025-11-17T12:00:21Z

@vustef I changed the PR in the sense that default values are not REE encoded anymore, but only constant fields that come from added metadata fields + partition data.

@liurenjie1024 let us know whether that is OK. If REE is desired in general for all constant columns, I guess it is better to make a follow-up PR to keep changesets smaller.

vustef · 2025-11-24T13:42:09Z

@liurenjie1024 just a friendly ping on this for the new round of your feedback, if you have time.

liurenjie1024 · 2025-11-19T10:31:38Z

crates/iceberg/src/metadata_columns.rs

+
+/// Reserved column name for the file path metadata column
+pub const RESERVED_COL_NAME_FILE: &str = "_file";
+


We should not create arrow field. Please remember that iceberg tries to be engine independent, and such core abstractions should use iceberg's own abastractions.

liurenjie1024 · 2025-11-19T10:35:05Z

crates/iceberg/src/metadata_columns.rs

+/// The Arrow Field definition for the metadata column, or an error if not a metadata field
+pub fn get_metadata_field(field_id: i32) -> Result<Arc<Field>> {
+    match field_id {
+        RESERVED_FIELD_ID_FILE => Ok(Arc::clone(file_field())),


This is fine for now, but I'm thinking maybe a better approach is to have a static map for this so that we don't need to repeat such pattern matching in different places.

liurenjie1024 · 2025-11-26T10:13:28Z

crates/iceberg/src/scan/mod.rs

+    /// # Ok(())
+    /// # }
+    /// ```
+    pub fn with_file_column(mut self) -> Self {


It's odd to add this function, user could just select("_file")

Gerald was saying the same, and we test both. But I suggested this because without this function users need to list all the other columns they want to query, while this is like select *, _file. Let us know if we should remove it though

I don't think this is a good addition. In general, the scan api is a low level api, users could provide other easy wrappers around it.

liurenjie1024 · 2025-11-26T10:17:49Z

crates/iceberg/src/arrow/record_batch_transformer.rs

    projected_iceberg_field_ids: Vec<i32>,
+    // Pre-computed constant field information: field_id -> (arrow_type, value)
+    // Avoids duplicate lookups and type conversions during batch processing
+    constant_fields: HashMap<i32, (DataType, PrimitiveLiteral)>,


What's the benefit of using arrow's DataType here?

liurenjie1024 · 2025-11-26T10:19:18Z

crates/iceberg/src/arrow/record_batch_transformer.rs

+    /// * `field_id` - The field ID to associate with the constant
+    /// * `value` - The constant value for this field
+    pub(crate) fn with_constant(mut self, field_id: i32, value: PrimitiveLiteral) -> Result<Self> {
+        let arrow_type = RecordBatchTransformer::primitive_literal_to_arrow_type(&value)?;


The primitive_literal_to_arrow_type doesn't make sense to me. PrimitiveLiteral is designed to be a literal without type, and we should not guess its type.

liurenjie1024 · 2025-11-26T10:19:51Z

crates/iceberg/src/arrow/record_batch_transformer.rs

+
+        // Add partition constants to constant_fields (compute REE types from literals)
+        for (field_id, value) in partition_constants {
+            let arrow_type = RecordBatchTransformer::primitive_literal_to_arrow_type(&value)?;


liurenjie1024 · 2025-11-26T10:20:48Z

crates/iceberg/src/arrow/record_batch_transformer.rs

-                    .ok_or(Error::new(ErrorKind::Unexpected, "field not found"))?
-                    .0
-                    .clone())
+                // Check if this is a constant field (virtual or partition)


Suggested change

// Check if this is a constant field (virtual or partition)

// Check if this is a constant field

gbrgr and others added 8 commits November 4, 2025 08:38

Add REE file column helpers

aab78d6

Add helper tests

ee21cab

Add constants

37b52e2

Add support for _file constant

44463a0

Merge branch 'main' into feature/gb/file-column

b5449f6

Update tests

e034009

Fix clippy warning

4f0a4f1

Fix doc test

51f76d3

vustef reviewed Nov 4, 2025

View reviewed changes

crates/iceberg/src/arrow/reader.rs Outdated Show resolved Hide resolved

gbrgr and others added 2 commits November 4, 2025 12:52

Track in field ids

d84e16b

Merge branch 'main' into feature/gb/file-column

984dacd

gbrgr changed the title ~~Add support for _file column~~ feat(core): Add support for _file column Nov 4, 2025

gbrgr added 2 commits November 4, 2025 15:32

Add test

bd478cb

Merge branch 'feature/gb/file-column' of github.com:RelationalAI/iceb…

8593db0

…erg-rust into feature/gb/file-column

gbrgr marked this pull request as ready for review November 4, 2025 14:32

gbrgr and others added 2 commits November 4, 2025 16:04

Allow repeated virtual file column selection

9b186c7

Merge branch 'main' into feature/gb/file-column

30ae5fb

alamb mentioned this pull request Nov 5, 2025

Support extended partition cols for listing table. apache/datafusion#18482

Open

liurenjie1024 requested changes Nov 6, 2025

View reviewed changes

liurenjie1024 mentioned this pull request Nov 6, 2025

General support for metadata columns + implementation for _pos #1791

Draft

gbrgr added 2 commits November 7, 2025 14:03

Refactor into own transformer step

adf0da0

Merge branch 'feature/gb/file-column' of github.com:RelationalAI/iceb…

f4336a8

…erg-rust into feature/gb/file-column

vustef reviewed Nov 7, 2025

View reviewed changes

gbrgr added 3 commits November 7, 2025 15:04

Revert "Refactor into own transformer step"

ef3a965

This reverts commit adf0da0.

Avoid special casing in batch creation

534490b

.

04bf463

liurenjie1024 reviewed Nov 10, 2025

View reviewed changes

gbrgr added 3 commits November 12, 2025 11:25

.

c05b886

Adapt error message

33bb0ad

Consider field_id range

42167ff

liurenjie1024 reviewed Nov 13, 2025

View reviewed changes

gbrgr added 3 commits November 13, 2025 11:53

Merge remote-tracking branch 'upstream/main' into feature/gb/file-column

cbc6b17

Merge remote-tracking branch 'upstream/main' into feature/gb/file-column

977c813

Use REE encoding in record batch transformer

83443aa

gbrgr added 3 commits November 14, 2025 09:42

Fix clippy errors

35aba12

Format

830e462

Add with_file_path_column helper

4eb8a63

gbrgr requested a review from liurenjie1024 November 17, 2025 08:18

gbrgr and others added 3 commits November 17, 2025 09:19

Merge branch 'main' into feature/gb/file-column

9d41b7f

Rename field

0b8f15b

Merge branch 'feature/gb/file-column' of github.com:RelationalAI/iceb…

7ce462e

…erg-rust into feature/gb/file-column

vustef reviewed Nov 17, 2025

View reviewed changes

gbrgr added 6 commits November 17, 2025 12:13

Rename method

fca14bd

Undo some changes

b7da6d3

.

7ebdf87

Re-refactor tests

edbc72a

Undo reader test changes

4a08ee6

.

671fd4f

gbrgr mentioned this pull request Nov 18, 2025

Port addition of _file column RelationalAI/iceberg-rust#10

Merged

gbrgr and others added 3 commits November 18, 2025 13:26

Move import

4757868

Merge branch 'main' into feature/gb/file-column

d1985c3

Merge branch 'main' into feature/gb/file-column

0d674a4

liurenjie1024 mentioned this pull request Nov 25, 2025

feat: Don't drop additional statistics #1849

Open

liurenjie1024 reviewed Nov 26, 2025

View reviewed changes


		/// Reserved column name for the file path metadata column
		pub const RESERVED_COL_NAME_FILE: &str = "_file";

	// Check if this is a constant field (virtual or partition)
	// Check if this is a constant field

feat(core): Add support for _file column #1824

Are you sure you want to change the base?

feat(core): Add support for _file column #1824

Conversation

gbrgr commented Nov 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

What changes are included in this PR?

Key Changes

Are these changes tested?

New Tests (7 tests added)

Table Scan API Tests (7 tests)

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

liurenjie1024 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

vustef commented Nov 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

liurenjie1024 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

liurenjie1024 commented Nov 10, 2025

Uh oh!

liurenjie1024 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gbrgr commented Nov 14, 2025

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gbrgr commented Nov 17, 2025

Uh oh!

vustef commented Nov 24, 2025

Uh oh!

feat(core): Add support for `_file` column #1824

feat(core): Add support for `_file` column #1824

gbrgr commented Nov 4, 2025 •

edited

Loading

vustef commented Nov 6, 2025 •

edited

Loading

liurenjie1024 Nov 26, 2025 •

edited

Loading