feat: metadata columns #14057

chenkovsky · 2025-01-09T14:43:31Z

Which issue does this PR close?

Rationale for this change

many databases support pseudo columns, for example, file_path, file_name, file_size, rowid.
for pseudo columns, we don't want to get them by default, but we want to be able to use them explicitly.

for the database that supports rowid. select * from tb won't return rowid. but we can get rowid by select rowid, * from tb. spark has already supported metadata columns. this PR want to support it in datafusion.

What changes are included in this PR?

add an API in table provider that will return metadata column schema.
change DFSchema add metadata column.
change logical plan e.g. TableScan to support it.

Are these changes tested?

Unit test is added

Are there any user-facing changes?

No

For FFI table provider API, one function that returns metadata column is added.

jayzhan211 · 2025-01-10T01:14:35Z

datafusion/common/src/dfschema.rs

+                return metadata.qualified_field(i - self.inner.len());
+            }
+        }
+        self.inner.qualified_field(i)


Is it better not to mix inner field and meta field?

maybe we need another method meta_field(&self, i: usize)

actually implementing another method was my first attempt. but I found that I need to change a lot of code, because column index is used everywhere. that's why in currently implementation metadata column has index + len(fields).

Isn't only where you need meta columns you need to change the code with meta_field? Others code that call with field remain the same.

The downside of the current approach is that whenever the schema is changed, the index of meta columns need to adjust too. I think this is error prone. Minimize the dependency of meta schema and schema is better

I see. it's error prone. Can we change the offsets of metadata columns, e.g. (-1 as usize) (-2 as usize) then there's no such problem. I see some databases use this trick.

Isn't only where you need meta columns you need to change the code with meta_field? Others code that call with field remain the same.

yes, we can. but many apis use Vec to represent columns. I have to change many structs and method defnitions to pass extra parameters.

(-1 as usize) how does this large offset work? We have vector instead of map

Hi @jayzhan211 I pushed a commit, could you please review it again?

Okay this approach looks good to me.

jayzhan211 · 2025-01-11T05:14:08Z

datafusion/common/src/dfschema.rs

-            .collect()
+        let mut fields: Vec<&Field> = self.inner.fields_with_unqualified_name(name);
+        if let Some(schema) = self.metadata_schema() {
+            fields.append(&mut schema.fields_with_unqualified_name(name));


Suggested change

fields.append(&mut schema.fields_with_unqualified_name(name));

fields.append(schema.fields_with_unqualified_name(name));

jayzhan211 · 2025-01-11T05:14:36Z

datafusion/common/src/dfschema.rs

+        let mut fields: Vec<(Option<&TableReference>, &Field)> =
+            self.inner.qualified_fields_with_unqualified_name(name);
+        if let Some(schema) = self.metadata_schema() {
+            fields.append(&mut schema.qualified_fields_with_unqualified_name(name));


Suggested change

fields.append(&mut schema.qualified_fields_with_unqualified_name(name));

fields.append(schema.qualified_fields_with_unqualified_name(name));

jayzhan211 · 2025-01-11T05:21:20Z

datafusion/expr/src/logical_plan/plan.rs

+                                    return (
+                                        Some(table_name.clone()),
+                                        Arc::new(
+                                            metadata.field(*i - METADATA_OFFSET).clone(),


handle where i < METADATA_OFFSET

jayzhan211

Great, wait others to review this

alamb

Thank you @chenkovsky and @jayzhan211 -- this is a neat feature and I think has also been asked for before 💯

Also, I think the code is well structured and tested.

Before we merge this PR I think we need

a test for more than one metadata column
ensure this doesn't slow down planning (I will run benchmarks and report back)

I would strongly recommend we do in this PR (but could do as a follow on)

More documentation (to help others and our future selves use it)
Change the test to use assert_batches_eq

alamb · 2025-01-12T12:01:48Z

datafusion/common/src/dfschema.rs

+        &self.inner.schema
+    }
+
+    pub fn with_metadata_schema(


Can we please document these APIs

alamb · 2025-01-12T12:02:48Z

datafusion/catalog/src/table.rs

@@ -55,6 +55,11 @@ pub trait TableProvider: Debug + Sync + Send {
    /// Get a reference to the schema for this table
    fn schema(&self) -> SchemaRef;

+    /// Get metadata columns of this table.
+    fn metadata_columns(&self) -> Option<SchemaRef> {


Can you please document this better -- specifically:

A link to the prior art (spark metadata columns)

A brief summary of what metadata columns are used for and an example (you can copy the content from the spark docs)

alamb · 2025-01-12T12:03:43Z

datafusion/common/src/dfschema.rs

+    metadata: Option<QualifiedSchema>,
+}
+
+pub const METADATA_OFFSET: usize = usize::MAX >> 1;


Can you please document what this is and how it relates to DFSchema::inner

alamb · 2025-01-12T12:05:13Z

datafusion/common/src/dfschema.rs

+    inner: QualifiedSchema,
+    /// Stores functional dependencies in the schema.
+    functional_dependencies: FunctionalDependencies,
+    /// metadata columns


Can you provide more documentation here to document what these are (perhaps adding a link to the higher level description you write on TableProvider::metadata_columns)

alamb · 2025-01-12T12:05:22Z

datafusion/common/src/dfschema.rs

+pub const METADATA_OFFSET: usize = usize::MAX >> 1;
+
+#[derive(Debug, Clone, PartialEq, Eq)]
+pub struct QualifiedSchema {


Please document what this struct is used for

alamb · 2025-01-12T12:06:12Z

datafusion/common/src/dfschema.rs

        }
    }

+    pub fn metadata_schema(&self) -> &Option<QualifiedSchema> {


Please add documentation -- imagine you are someone using this API and are not familar with metadata_schema or the content of this API. I think you would want a short summary of what this is and then a link to the full details

alamb · 2025-01-12T12:08:22Z

datafusion/core/tests/sql/metadata_columns.rs

+use datafusion_common::METADATA_OFFSET;
+use itertools::Itertools;
+
+/// A User, with an id and a bank account


This is is actually quite a cool example of using metadata index

Eventually I think it would be great to add an example in https://github.com/apache/datafusion/tree/main/datafusion-examples

alamb · 2025-01-12T12:10:08Z

datafusion/core/tests/sql/metadata_columns.rs

+        .unwrap();
+    let batch = concat_batches(&all_batchs[0].schema(), &all_batchs).unwrap();
+    assert_eq!(batch.num_rows(), 2);
+    let serializer = CsvSerializer::new().with_header(false);


To check the results, can you please use assert_batches_eq instead of converting to CSV?

That is

more consistent with the rest of the codebase

easier to read

easier to update

For example:

datafusion/datafusion/core/tests/sql/select.rs

Lines 69 to 95 in 167c11e

let expected = vec![

"+----+----+",

"| c1 | c2 |",

"+----+----+",

"| 1 | 1 |",

"| 1 | 2 |",

"| 1 | 3 |",

"| 1 | 4 |",

"| 1 | 5 |",

"| 1 | 6 |",

"| 1 | 7 |",

"| 1 | 8 |",

"| 1 | 9 |",

"| 1 | 10 |",

"| 2 | 1 |",

"| 2 | 2 |",

"| 2 | 3 |",

"| 2 | 4 |",

"| 2 | 5 |",

"| 2 | 6 |",

"| 2 | 7 |",

"| 2 | 8 |",

"| 2 | 9 |",

"| 2 | 10 |",

"+----+----+",

];

assert_batches_sorted_eq!(expected, &results);

alamb · 2025-01-12T12:11:59Z

datafusion/core/tests/sql/metadata_columns.rs

+    let all_batchs = df5.collect().await.unwrap();
+    let batch = concat_batches(&all_batchs[0].schema(), &all_batchs).unwrap();
+    let bytes = serializer.serialize(batch, true).unwrap();
+    assert_eq!(bytes, "1,2\n");


Can we please also add a test for more than one metadata column?

alamb · 2025-01-12T12:18:01Z

Something other people have asked for in the past (whihc I can't find now) is the ability to know what file a particular row came from in a listing table that combines multiple files

adriangb · 2025-01-12T14:33:22Z

We want this as well to hide "special" internal columns we create to speed up JSON columns. +1 for the feature!

adriangb · 2025-01-12T14:50:06Z

My only question is if "metadata" is the right name for these columns? Could it be "system" columns or something like that?

Omega359 · 2025-01-12T21:49:34Z

Metadata column is the name I'm familiar with in other systems. For example, spark/databricks

adriangb · 2025-01-12T22:43:54Z

I guess the naming doesn't really hurt our use case so okay let's go with that if it means something in the domain in general 👍🏻

feat: metadata columns

5bfc297

github-actions bot added logical-expr Logical plan and expressions optimizer Optimizer rules core Core DataFusion crate catalog Related to the catalog crate common Related to common crate labels Jan 9, 2025

chenkovsky mentioned this pull request Jan 9, 2025

datafusion-python integration lancedb/lance#3334

Open

format

05a475b

jayzhan211 reviewed Jan 10, 2025

View reviewed changes

chenkovsky added 2 commits January 10, 2025 15:51

format code

8e73cbe

update metadata offset

e9a0d6f

jayzhan211 reviewed Jan 11, 2025

View reviewed changes

update

a4dee3e

jayzhan211 approved these changes Jan 11, 2025

View reviewed changes

alamb reviewed Jan 12, 2025

View reviewed changes

This was referenced Jan 12, 2025

Jan 1, 2025: This week(s) in DataFusion #13970

Open

metadata column support #13975

Open

alamb mentioned this pull request Jan 12, 2025

No efficient way to load a subset of files from partitioned table #8906

Open

add document, refine test

1ab8c7d

add example

5c4b5c4

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: metadata columns #14057

feat: metadata columns #14057

chenkovsky commented Jan 9, 2025

jayzhan211 Jan 10, 2025

chenkovsky Jan 10, 2025

jayzhan211 Jan 11, 2025 •

edited

Loading

chenkovsky Jan 11, 2025 •

edited

Loading

jayzhan211 Jan 11, 2025

chenkovsky Jan 11, 2025

jayzhan211 Jan 11, 2025

jayzhan211 Jan 11, 2025

jayzhan211 Jan 11, 2025

jayzhan211 Jan 11, 2025

jayzhan211 left a comment

alamb left a comment

alamb Jan 12, 2025

alamb Jan 12, 2025

alamb Jan 12, 2025

alamb Jan 12, 2025

alamb Jan 12, 2025

alamb Jan 12, 2025

alamb Jan 12, 2025

alamb Jan 12, 2025

alamb Jan 12, 2025

alamb commented Jan 12, 2025

adriangb commented Jan 12, 2025

adriangb commented Jan 12, 2025

Omega359 commented Jan 12, 2025

adriangb commented Jan 12, 2025

	fields.append(&mut schema.fields_with_unqualified_name(name));
	fields.append(schema.fields_with_unqualified_name(name));

	fields.append(&mut schema.qualified_fields_with_unqualified_name(name));
	fields.append(schema.qualified_fields_with_unqualified_name(name));

	let expected = vec![
	"+----+----+",
	"\| c1 \| c2 \|",
	"+----+----+",
	"\| 1 \| 1 \|",
	"\| 1 \| 2 \|",
	"\| 1 \| 3 \|",
	"\| 1 \| 4 \|",
	"\| 1 \| 5 \|",
	"\| 1 \| 6 \|",
	"\| 1 \| 7 \|",
	"\| 1 \| 8 \|",
	"\| 1 \| 9 \|",
	"\| 1 \| 10 \|",
	"\| 2 \| 1 \|",
	"\| 2 \| 2 \|",
	"\| 2 \| 3 \|",
	"\| 2 \| 4 \|",
	"\| 2 \| 5 \|",
	"\| 2 \| 6 \|",
	"\| 2 \| 7 \|",
	"\| 2 \| 8 \|",
	"\| 2 \| 9 \|",
	"\| 2 \| 10 \|",
	"+----+----+",
	];
	assert_batches_sorted_eq!(expected, &results);

feat: metadata columns #14057

Are you sure you want to change the base?

feat: metadata columns #14057

Conversation

chenkovsky commented Jan 9, 2025

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jayzhan211 Jan 11, 2025 • edited Loading

Choose a reason for hiding this comment

chenkovsky Jan 11, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jayzhan211 left a comment

Choose a reason for hiding this comment

alamb left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alamb commented Jan 12, 2025

adriangb commented Jan 12, 2025

adriangb commented Jan 12, 2025

Omega359 commented Jan 12, 2025

adriangb commented Jan 12, 2025

jayzhan211 Jan 11, 2025 •

edited

Loading

chenkovsky Jan 11, 2025 •

edited

Loading