Add coerce int96 option for Parquet to support different TimeUnits, test int96_from_spark.parquet from parquet-testing #15537

mbutrovich · 2025-04-01T17:56:41Z

Which issue does this PR close?

N/A.

Rationale for this change

We are adding Spark-compatible int96 support to DataFusion Comet when using arrow-rs's Parquet reader. To achieve this, we first added support for arrow-rs to read int96 at different resolutions than nanosecond. It would previously generate nulls for non-null values. Next, we will add support to DataFusion to generate the necessary schema for arrow-rs to read int96 at the resolution that Spark expects. Finally, we will connect everything together in DataFusion Comet for accelerated Parquet reading with int96 values.

What changes are included in this PR?

new option in ParquetOptions to coerce int96 resolution, with serialization support (I think I did this correctly)
bump parquet-testing submodule

Are these changes tested?

Added a new test that relies on new int96_from_spark.parquet in parquet-testing.

Are there any user-facing changes?

There is a new field in ParquetOptions. There is an API-change to a pub(crate) test function to accept a provided table schema.

mbutrovich · 2025-04-03T17:14:30Z

apache/parquet-testing#73 merged so I updated the parquet-testing dependency. Now waiting on an arrow-rs release and DF bumping to that version.

…or last value.

# Conflicts: # Cargo.lock # Cargo.toml # datafusion/datasource-parquet/src/opener.rs

mbutrovich · 2025-04-14T16:16:15Z

I believe all dependencies are updated, marking this as ready for review.

andygrove

LGTM although I am not familiar with the serde part. Thanks @mbutrovich.

parthchandra

lgtm.
(couple of minor nits, you may ignore)

datafusion/datasource-parquet/src/source.rs

datafusion/sqllogictest/test_files/information_schema.slt

datafusion/datasource-parquet/src/file_format.rs

alamb · 2025-04-14T19:43:18Z

datafusion/common/src/config.rs

+        /// which stores microsecond resolution timestamps in an int96 allowing it
+        /// to write values with a larger date range than 64-bit timestamps with
+        /// nanosecond resolution.
+        pub coerce_int96: Option<String>, transform = str::to_lowercase, default = None


I wonder if there is any usecase for int96 other than timestamps.

Specifically, maybe we can simply always change the behavior and coerce int96 --> microseconds

At the very least default the option to be enabled perhaps

I wonder if there is any usecase for int96 other than timestamps.

Not as far as I know, but I don't think the (deprecated) int96 spec said that it had to represent a timestamp. It's just where Spark, Hive, Impala, etc. ended up.

Specifically, maybe we can simply always change the behavior and coerce int96 --> microseconds
At the very least default the option to be enabled perhaps

It's not clear to me if we should assume that an int96 originated from a system that treated the originating timestamp it as microseconds. While it's very likely that it originated from one of those systems, I don't know how to treat the default in this case. Snowflake, for example, seems to use microseconds for its timestamps when dealing with Iceberg:
https://docs.snowflake.com/en/user-guide/tables-iceberg-data-types#supported-data-types-for-iceberg-tables

I'm hesitant to mess with defaults, but am open to hearing more from the community. @parthchandra

Given how dominant spark is and how rarely used int96 is outside the spark ecosystem, I was thinking that basically if anyone had such a file it is likely we should treat the values as microseconds.

I don't have a strong preference, I was just trying to come up with a way to keep the code less compilcated

IIRC, the use of int96 originated from Impala/Parquet-cpp where it was used to store nanoseconds (The C++ implementation came from the Impala team). I think the Java implementation ended up with int96 in order to be compatible. Spark came along with its own variant and well, here we are.
(https://issues.apache.org/jira/browse/PARQUET-323)
The Parquet community assumed that this was the only usage of int96 before it was deprecated so I feel it is a safe for us to assume the same.
It can be done as a follow up, though, I feel.

alamb

Thanks @mbutrovich -- I am wondering what

I checked that the data seems to come out ok with datafusion 46. Can you remind me what the different would be with this option (that the timestamp type is different?)

andrewlamb@Andrews-MacBook-Pro-2:~/Software/datafusion/parquet-testing$ ~/Software/datafusion-cli/datafusion-cli-46.0.0 -c "select * from 'data/int96_from_spark.parquet'";
DataFusion CLI v46.0.0
+-------------------------------+
| a                             |
+-------------------------------+
| 2024-01-01T20:34:56.123456    |
| 2024-01-01T01:00:00           |
| 1816-03-29T08:56:08.066277376 |
| 2024-12-30T23:00:00           |
| NULL                          |
| 2147-08-27T00:35:19.850745856 |
+-------------------------------+
6 row(s) fetched.
Elapsed 0.007 seconds.

andrewlamb@Andrews-MacBook-Pro-2:~/Software/datafusion/parquet-testing$ ~/Software/datafusion-cli/datafusion-cli-46.0.0 -c "describe 'data/int96_from_spark.parquet'";
DataFusion CLI v46.0.0
+-------------+-----------------------------+-------------+
| column_name | data_type                   | is_nullable |
+-------------+-----------------------------+-------------+
| a           | Timestamp(Nanosecond, None) | YES         |
+-------------+-----------------------------+-------------+
1 row(s) fetched.
Elapsed 0.001 seconds.

alamb · 2025-04-14T19:46:23Z

docs/source/user-guide/configs.md

@@ -58,6 +58,7 @@ Environment variables are read during `SessionConfig` initialisation so they mus
 | datafusion.execution.parquet.reorder_filters                            | false                     | (reading) If true, filter expressions evaluated during the parquet decoding operation will be reordered heuristically to minimize the cost of evaluation. If false, the filters are applied in the same order as written in the query                                                                                                                                                                                                                                                                                                                                    |


I recommend we add a test to

datafusion/datafusion/sqllogictest/test_files/parquet.slt

Lines 174 to 190 in df0d966

# Setup alltypes_plain table:

statement ok

CREATE EXTERNAL TABLE alltypes_plain (

id INT NOT NULL,

bool_col BOOLEAN NOT NULL,

tinyint_col TINYINT NOT NULL,

smallint_col SMALLINT NOT NULL,

int_col INT NOT NULL,

bigint_col BIGINT NOT NULL,

float_col FLOAT NOT NULL,

double_col DOUBLE NOT NULL,

date_string_col BYTEA NOT NULL,

string_col VARCHAR NOT NULL,

timestamp_col TIMESTAMP NOT NULL,

)

STORED AS PARQUET

LOCATION '../../parquet-testing/data/alltypes_plain.parquet';

as well to show this working in the SQL layer

mbutrovich · 2025-04-14T20:45:15Z

I checked that the data seems to come out ok with datafusion 46. Can you remind me what the different would be with this option (that the timestamp type is different?)

andrewlamb@Andrews-MacBook-Pro-2:~/Software/datafusion/parquet-testing$ ~/Software/datafusion-cli/datafusion-cli-46.0.0 -c "select * from 'data/int96_from_spark.parquet'";
DataFusion CLI v46.0.0
+-------------------------------+
| a                             |
+-------------------------------+
| 2024-01-01T20:34:56.123456    |
| 2024-01-01T01:00:00           |
| 1816-03-29T08:56:08.066277376 |
| 2024-12-30T23:00:00           |
| NULL                          |
| 2147-08-27T00:35:19.850745856 |
+-------------------------------+
6 row(s) fetched.
Elapsed 0.007 seconds.

Without coercion:

matt@Matthews-MacBook-Pro parquet-testing % ../target/debug/datafusion-cli -c "select * from 'data/int96_from_spark.parquet'";
DataFusion CLI v46.0.1
+-------------------------------+
| a                             |
+-------------------------------+
| 2024-01-01T20:34:56.123456    |
| 2024-01-01T01:00:00           |
| 1816-03-29T08:56:08.066277376 |
| 2024-12-30T23:00:00           |
| NULL                          |
| 1815-11-08T16:01:01.191053312 |
+-------------------------------+
6 row(s) fetched. 
Elapsed 0.006 seconds.

With coercion:

matt@Matthews-MacBook-Pro parquet-testing % ../target/debug/datafusion-cli -c "set datafusion.execution.parquet.coerce_int96 to 'us'; select * from 'data/int96_from_spark.parquet'";
DataFusion CLI v46.0.1
0 row(s) fetched. 
Elapsed 0.001 seconds.

+----------------------------+
| a                          |
+----------------------------+
| 2024-01-01T20:34:56.123456 |
| 2024-01-01T01:00:00        |
| NULL                       |
| 2024-12-30T23:00:00        |
| NULL                       |
| NULL                       |
+----------------------------+
6 row(s) fetched. 
Elapsed 0.005 seconds.

Frustratingly, those two new nulls aren't really nulls. They're the challenging values that we want to be able to read back in Comet. However, we can't print them with current chrono behavior which is why I didn't test at the SQL layer. However, the real values are in there and we'll be able to do what we need to do in Comet at the SchemaAdapter level with this change.

andygrove · 2025-04-15T12:13:08Z

Let's go ahead and merge this. The new config is optional, so won't affect existing users, and it meets the requirements for Comet to switch over to using DataFusion's ParquetExec by default.

@alamb It seems that we cannot add SQL tests until apache/arrow-rs#7287 is resolved, which we won't be able to do in time for the DataFusion 47.0.0 release.

alamb · 2025-04-15T18:14:13Z

datafusion/sqllogictest/test_files/information_schema.slt

@@ -296,6 +297,7 @@ datafusion.execution.parquet.bloom_filter_fpp NULL (writing) Sets bloom filter f
 datafusion.execution.parquet.bloom_filter_ndv NULL (writing) Sets bloom filter number of distinct values. If NULL, uses default parquet writer setting
 datafusion.execution.parquet.bloom_filter_on_read true (writing) Use any available bloom filters when reading parquet files
 datafusion.execution.parquet.bloom_filter_on_write false (writing) Write bloom filters for all columns when creating parquet files
+datafusion.execution.parquet.coerce_int96 NULL (reading) If true, parquet reader will read columns of physical type int96 as originating from a different resolution than nanosecond. This is useful for reading data from systems like Spark which stores microsecond resolution timestamps in an int96 allowing it to write values with a larger date range than 64-bit timestamps with nanosecond resolution.


While reviewiew this PR again, I think this text is not quite right -- it isn't true, instead it takes a string value ms, ns, us, etc for the timestamp resolution

External error: task 17 panicked with message "called `Result::unwrap()` on an `Err` value: Configuration(\"Unknown or unsupported parquet coerce_int96: true. Valid values are: ns, us, ms, and s.\")"

I will file a ticket

Strangely setting it to true seems to work in datafusion-cli 🤔

DataFusion CLI v46.0.1 > set datafusion.execution.parquet.coerce_int96 = true; 0 row(s) fetched. Elapsed 0.000 seconds. > show all; ... | datafusion.execution.parquet.coerce_int96 | true |

alamb · 2025-04-15T18:33:31Z

I created a PR with some SLT tests for this feature

Add slt tests for datafusion.execution.parquet.coerce_int96 setting #15723

mbutrovich added 10 commits March 31, 2025 12:00

Stash.

4e8b309

Stash.

958050c

Checkpoint.

2dbcfbf

update arrow

7dd593d

Fix after merging main.

ad32c9f

Merge branch 'main' into int96_again

6fad37b

Merge branch 'main' into int96_again

72b2ac7

Add test for int96_from_spark.

4bfdc92

Remove commented out code.

f080f83

Update parquet-testing to include int96_from_spark.parquet.

69ed7d4

github-actions bot added core Core DataFusion crate common Related to common crate proto Related to proto crate datasource Changes to the datasource crate labels Apr 3, 2025

andygrove mentioned this pull request Apr 8, 2025

Release DataFusion 47.0.0 (April 2025) #15072

Open

39 tasks

mbutrovich added 2 commits April 8, 2025 15:35

Add more test scenarios. Need to understand why ms and s look wrong f…

b1a31ba

…or last value.

Adjust test scenarios.

8f372a1

andygrove mentioned this pull request Apr 10, 2025

Upgrade to DataFusion 47.0.0 apache/datafusion-comet#1634

Closed

Merge branch 'main' into int96_again

f5f41ff

# Conflicts: # Cargo.lock # Cargo.toml # datafusion/datasource-parquet/src/opener.rs

mbutrovich marked this pull request as ready for review April 14, 2025 16:15

andygrove approved these changes Apr 14, 2025

View reviewed changes

Fix sqlllogic tests.

796c5f1

github-actions bot added the sqllogictest SQL Logic Tests (.slt) label Apr 14, 2025

Update docs.

f9227c9

github-actions bot added the documentation Improvements or additions to documentation label Apr 14, 2025

clippy

e8e0324

andygrove requested review from alamb and parthchandra April 14, 2025 16:46

andygrove requested a review from comphead April 14, 2025 16:46

Fix docs.

4a48ff9

parthchandra approved these changes Apr 14, 2025

View reviewed changes

datafusion/datasource-parquet/src/source.rs Show resolved Hide resolved

datafusion/sqllogictest/test_files/information_schema.slt Outdated Show resolved Hide resolved

datafusion/sqllogictest/test_files/information_schema.slt Outdated Show resolved Hide resolved

Adjust config wording based on PR feedback.

b75a960

alamb reviewed Apr 14, 2025

View reviewed changes

Update comment.

83a1397

andygrove merged commit 7ff6c7e into apache:main Apr 15, 2025
29 checks passed

mbutrovich mentioned this pull request Apr 15, 2025

fix: better int96 support for experimental native scans apache/datafusion-comet#1652

Merged

This was referenced Apr 15, 2025

get error value if timestamp represented by the INT96 in the parquet file #9981

Closed

When datafusion.execution.parquet.coerce_int96 is set, timestamp type is still reported as Timestamp(nanoseconds) #15721

Closed

alamb reviewed Apr 15, 2025

View reviewed changes

alamb mentioned this pull request Apr 15, 2025

Add slt tests for datafusion.execution.parquet.coerce_int96 setting #15723

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add coerce int96 option for Parquet to support different TimeUnits, test int96_from_spark.parquet from parquet-testing #15537

Add coerce int96 option for Parquet to support different TimeUnits, test int96_from_spark.parquet from parquet-testing #15537

mbutrovich commented Apr 1, 2025 •

edited

Loading

mbutrovich commented Apr 3, 2025

mbutrovich commented Apr 14, 2025

andygrove left a comment

parthchandra left a comment

alamb Apr 14, 2025

mbutrovich Apr 14, 2025 •

edited

Loading

alamb Apr 14, 2025

parthchandra Apr 14, 2025

alamb left a comment

alamb Apr 14, 2025

mbutrovich commented Apr 14, 2025

andygrove commented Apr 15, 2025

alamb Apr 15, 2025

alamb Apr 15, 2025

alamb commented Apr 15, 2025

		@@ -58,6 +58,7 @@ Environment variables are read during `SessionConfig` initialisation so they mus
		\| datafusion.execution.parquet.reorder_filters \| false \| (reading) If true, filter expressions evaluated during the parquet decoding operation will be reordered heuristically to minimize the cost of evaluation. If false, the filters are applied in the same order as written in the query \|

	# Setup alltypes_plain table:
	statement ok
	CREATE EXTERNAL TABLE alltypes_plain (
	id INT NOT NULL,
	bool_col BOOLEAN NOT NULL,
	tinyint_col TINYINT NOT NULL,
	smallint_col SMALLINT NOT NULL,
	int_col INT NOT NULL,
	bigint_col BIGINT NOT NULL,
	float_col FLOAT NOT NULL,
	double_col DOUBLE NOT NULL,
	date_string_col BYTEA NOT NULL,
	string_col VARCHAR NOT NULL,
	timestamp_col TIMESTAMP NOT NULL,
	)
	STORED AS PARQUET
	LOCATION '../../parquet-testing/data/alltypes_plain.parquet';

Add coerce int96 option for Parquet to support different TimeUnits, test int96_from_spark.parquet from parquet-testing #15537

Add coerce int96 option for Parquet to support different TimeUnits, test int96_from_spark.parquet from parquet-testing #15537

Conversation

mbutrovich commented Apr 1, 2025 • edited Loading

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

mbutrovich commented Apr 3, 2025

mbutrovich commented Apr 14, 2025

andygrove left a comment

Choose a reason for hiding this comment

parthchandra left a comment

Choose a reason for hiding this comment

alamb Apr 14, 2025

Choose a reason for hiding this comment

mbutrovich Apr 14, 2025 • edited Loading

Choose a reason for hiding this comment

alamb Apr 14, 2025

Choose a reason for hiding this comment

parthchandra Apr 14, 2025

Choose a reason for hiding this comment

alamb left a comment

Choose a reason for hiding this comment

alamb Apr 14, 2025

Choose a reason for hiding this comment

mbutrovich commented Apr 14, 2025

andygrove commented Apr 15, 2025

alamb Apr 15, 2025

Choose a reason for hiding this comment

alamb Apr 15, 2025

Choose a reason for hiding this comment

alamb commented Apr 15, 2025

mbutrovich commented Apr 1, 2025 •

edited

Loading

mbutrovich Apr 14, 2025 •

edited

Loading