fix: ignore `DataType::Null` in possible types during csv type inference #17796

dqkqd · 2025-09-26T15:32:50Z

Which issue does this PR close?

Closes Can't read a directory of parquet files: 'Stop:Arrival time' because the from data_type = Timestamp(Second, None) does not equal Utf8 #17517.

Rationale for this change

Datafusion cannot infer types correctly for CSV files where one of their chunks only contains NULLs.

Example: Consider the file below

a       // 1st chunk
1       // 1st chunk: Int64
2       // 1st chunk: Int64
<null>  // 2nd chunk: null
<null>  // 2nd chunk: null

Because all the records in the second chunk are nulls, Datafusion sees the possible data types for a as [Int64, Null], and thus infers the data type as Utf8.

datafusion/datafusion/datasource-csv/src/file_format.rs

Lines 610 to 613 in 691dd47

    
           } else { 
        
               // default to Utf8 for conflicting datatypes (e.g bool and int) 
        
               Field::new(field_name, DataType::Utf8, true) 
        
           }

What changes are included in this PR?

Ignore DataType::Null when inferring data type from possible data types.

Are these changes tested?

Yes.

Are there any user-facing changes?

No.

alamb

Thank you so much @dqkqd for debugging this issue. It is a great find.

I left a few suggestions on how to improve the code, but given this solves the issue I think we can do them as a follow on (or never)

I verified that this PR does fix the original report in #17517:

> select count(*), "Service:Type" from 'services-parquet' GROUP BY 2 order by 1 desc;
+----------+----------------------+
| count(*) | Service:Type         |
+----------+----------------------+
| 64689377 | Sprinter             |
| 33813656 | Stoptrein            |
| 31252046 | Intercity            |
| 3714068  | Sneltrein            |
| 2284824  | Stopbus i.p.v. trein |
| 1642463  | Intercity direct     |
| 1532087  | stoptrein            |
| 1235170  | Stopbus ipv trein    |
| 772315   | Snelbus i.p.v. trein |
| 515543   | Snelbus ipv trein    |
...
| 83       | Train Charter        |
| 37       | Krokus Express       |
| 12       | Niet instappen       |
| 4        | InnovationXpress     |
| 2        | Tram i.p.v. trein    |
+----------+----------------------+
37 row(s) fetched.
Elapsed 0.170 seconds.

However, I couldn't figure out how it worked. 🤔 It seems like this PR has changed the CSV type inference so that it correctly resolves "Stop:Departure time" to a timestamp in 'services-parquet/services-2020.parquet'.

Can you explain why that is?

>  describe 'services-parquet/services-2020.parquet';
+------------------------------+-------------------------+-------------+
| column_name                  | data_type               | is_nullable |
+------------------------------+-------------------------+-------------+
| Service:RDT-ID               | Int64                   | YES         |
| Service:Date                 | Date32                  | YES         |
| Service:Type                 | Utf8View                | YES         |
| Service:Company              | Utf8View                | YES         |
| Service:Train number         | Int64                   | YES         |
| Service:Completely cancelled | Boolean                 | YES         |
| Service:Partly cancelled     | Boolean                 | YES         |
| Service:Maximum delay        | Int64                   | YES         |
| Stop:RDT-ID                  | Int64                   | YES         |
| Stop:Station code            | Utf8View                | YES         |
| Stop:Station name            | Utf8View                | YES         |
| Stop:Arrival time            | Timestamp(Second, None) | YES         |
| Stop:Arrival delay           | Int64                   | YES         |
| Stop:Arrival cancelled       | Boolean                 | YES         |
| Stop:Departure time          | Timestamp(Second, None) | YES         | <-- this column is now correct 
| Stop:Departure delay         | Int64                   | YES         |
| Stop:Departure cancelled     | Boolean                 | YES         |
+------------------------------+-------------------------+-------------+
17 row(s) fetched.
Elapsed 0.008 seconds.

Though for some reason I still can't select * from the original directory of CSV files, which I will file a follow on ticket for

alamb · 2025-09-26T19:46:28Z

datafusion/datasource-csv/src/file_format.rs

-            // if there are incompatible types, use DataType::Utf8
-            match data_type_possibilities.len() {
-                1 => Field::new(
+            // determine data type based on possible types, ignoring DataType::Null,


Thank you @dqkqd -- this looks like it would work well. I played around with it and I think we might be able to make this simpler and more efficient by passing in the HashTable and removing the the null. Something like this seemed to work locally

// changed signature to take Vec<HashSet<DataType>>. ----v fn build_schema_helper(names: Vec<String>, types: Vec<HashSet<DataType>>) -> Schema { let fields = names .into_iter() .zip(types) .map(|(field_name, mut data_type_possibilities)| { // ripped from arrow::csv::reader::infer_reader_schema_with_csv_options // determine data type based on possible types // Remove Null (missing column) from possibilities data_type_possibilities.remove(&DataType::Null); <--- changed this // if there are incompatible types, use DataType::Utf8 match data_type_possibilities.len() { ...

Though when I did this I found some test failures

---- datasource::file_format::csv::tests::infer_schema stdout ---- thread 'datasource::file_format::csv::tests::infer_schema' panicked at datafusion/core/src/datasource/file_format/csv.rs:273:9: assertion `left == right` failed left: ["c1: Utf8", "c2: Int64", "c3: Int64", "c4: Int64", "c5: Int64", "c6: Int64", "c7: Int64", "c8: Int64", "c9: Int64", "c10: Utf8", "c11: Float64", "c12: Float64", "c13: Utf8", "c14: Null", "c15: Utf8"] right: ["c1: Utf8", "c2: Int64", "c3: Int64", "c4: Int64", "c5: Int64", "c6: Int64", "c7: Int64", "c8: Int64", "c9: Int64", "c10: Utf8", "c11: Float64", "c12: Float64", "c13: Utf8", "c14: Utf8", "c15: Utf8"] note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace ---- datasource::file_format::csv::tests::infer_schema_with_null_regex stdout ---- thread 'datasource::file_format::csv::tests::infer_schema_with_null_regex' panicked at datafusion/core/src/datasource/file_format/csv.rs:324:9: assertion `left == right` failed left: ["c1: Utf8", "c2: Int64", "c3: Int64", "c4: Int64", "c5: Int64", "c6: Int64", "c7: Int64", "c8: Int64", "c9: Int64", "c10: Utf8", "c11: Float64", "c12: Float64", "c13: Utf8", "c14: Null", "c15: Null"] right: ["c1: Utf8", "c2: Int64", "c3: Int64", "c4: Int64", "c5: Int64", "c6: Int64", "c7: Int64", "c8: Int64", "c9: Int64", "c10: Utf8", "c11: Float64", "c12: Float64", "c13: Utf8", "c14: Utf8", "c15: Utf8"] failures: datasource::file_format::csv::tests::infer_schema datasource::file_format::csv::tests::infer_schema_with_null_regex

Thanks for pointing out. I simplified the code as suggested.

These tests failed because there were columns with nulls only, after removing them, the match arm fell to Utf8 case. But they asserted that nulls columns should have data type DataType::Null.

alamb · 2025-09-26T20:11:59Z

datafusion/core/src/datasource/file_format/csv.rs

    }

+    #[tokio::test]
+    async fn test_infer_schema_stream_separated_chunks_with_nulls() -> Result<()> {


I don't understand how this test coves the new code.

However, I did verify that without this code change, this PR fails like this:

thread 'datasource::file_format::csv::tests::test_infer_schema_stream_separated_chunks_with_nulls' panicked at datafusion/core/src/datasource/file_format/csv.rs:511:9: assertion `left == right` failed left: ["c1: Int64", "c2: Float64"] right: ["c1: Utf8", "c2: Utf8"] note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

Datafusion infers data type from each chunk separately, then combines all the possible types.

This test creates a ChunkedStore, reading each line as a separated chunk (one of them contains only nulls),
then ensure type inference shouldn't be skewed by null chunks.

I should have commented and make the test clearer.

dqkqd · 2025-09-27T02:05:11Z

The test failed. It ensures an empty table should have its columns infer as Uft8.
DuckDB does the same so I think this is correct.

D CREATE TABLE empty AS
  SELECT * FROM read_csv_auto('empty.csv');
D select * from empty;
┌─────────┬─────────┬─────────┐
│   c1    │   c2    │   c3    │
│ varchar │ varchar │ varchar │
├─────────┴─────────┴─────────┤
│           0 rows            │
└─────────────────────────────┘

When I check how DuckDB handles table with null columns, it infer those columns as VARCHAR.

D CREATE TABLE has_nulls_column AS
  SELECT * FROM read_csv_auto('has_nulls_column.csv');
D select * from has_nulls_column;
┌───────┬───────┬─────────┐
│  c1   │  c2   │   c3    │
│ int64 │ int64 │ varchar │
├───────┼───────┼─────────┤
│     1 │     2 │ NULL    │
│     3 │     4 │ NULL    │
└───────┴───────┴─────────┘

However, datafusion infers those as Null. I think we should change them to Utf8.

> CREATE EXTERNAL TABLE has_nulls_column STORED AS CSV LOCATION 'has_nulls_column.csv' OPTIONS ('format.has_header' 'true');
0 row(s) fetched.
Elapsed 0.025 seconds

> select column_name, data_type, ordinal_position from information_schema.columns where table_name='has_nulls_column';
+-------------+-----------+------------------+
| column_name | data_type | ordinal_position |
+-------------+-----------+------------------+
| c1          | Int64     | 0                |
| c2          | Int64     | 1                |
| c3          | Null      | 2                |
+-------------+-----------+------------------+
3 row(s) fetched.
Elapsed 0.010 seconds.

I don't think this is hard to do, just fallback to Utf8 when a column is all nulls.
Then add some testcases for mixed null columns, all null columns, and null regex. (maybe rewrite tests from #13228)

@alamb Would you like me to handle these cases in this PR?
Or I should just handle empty table by returning Utf8 and cover the remaining cases in another PR?

dqkqd · 2025-09-29T12:44:27Z

I've just realized that returning Utf8 for columns with only nulls (or empty files) causes schema mismatch when reading folders containing those files along with normal files.
So maybe returning DataType::Null is a better choice. I'll revert the code and update testcase.

> cat test_data/a.csv
c1,c2,c3

> cat test_data/b.csv
c1,c2,c3
1,1,1
2,2,2

This fails in main.

DataFusion CLI v50.0.0
> select * from 'test_data';
Arrow error: Schema error: Fail to merge schema field 'c1' because the from data_type = Utf8 does not equal Int64

alamb · 2025-09-29T13:57:10Z

datafusion/core/src/datasource/file_format/csv.rs

+
+        // a stream where each line is read as a separate chunk,
+        // data type for each chunk is inferred separately.
+        // +----+-----+----+


thank you for these comments

datafusion/datasource-csv/src/file_format.rs

alamb

Thank you @dqkqd -- this now looks really nice 🏅 🏆 ❤️

alamb · 2025-09-29T14:00:41Z

datafusion/sqllogictest/test_files/ddl.slt

 CREATE EXTERNAL TABLE empty STORED AS CSV LOCATION '../core/tests/data/empty.csv' OPTIONS ('format.has_header' 'true');

 query TTI
 select column_name, data_type, ordinal_position from information_schema.columns where table_name='empty';;


this makes sense to me

…type-inference

alamb · 2025-09-29T14:08:16Z

Sadly, I tried this branch with the reproducer in

Can't read a directory of CSV files / CSV schema evolution: incorrect number of fields for line 1, expected 17 got 20 #17516

And it seems that CSV type inference is still not working correctly 😢

To be clear, I don't think anything needs to change in this PR, I was just hoping it also fixed something else, which it did not

Update: @EeshanBembi actually already has a fix for it!

feat: Support reading CSV files with inconsistent column counts #17553

alamb · 2025-10-01T19:51:08Z

🦾 🚀

alamb · 2025-10-01T19:51:20Z

Thanks again @dqkqd -- this is a very nice first PR

fix: ignore DataType::Null in possible types during csv type inference

8e8161d

github-actions bot added core Core DataFusion crate datasource Changes to the datasource crate labels Sep 26, 2025

alamb approved these changes Sep 26, 2025

View reviewed changes

dqkqd added 2 commits September 27, 2025 07:59

refactor: apply suggession, simplify inferring csv data types

0f0af72

docs: add comment to testcase

d604790

dqkqd force-pushed the csv-format-incorrect-type-inference branch from 20f8e51 to 8f1929a Compare September 27, 2025 03:19

dqkqd marked this pull request as draft September 27, 2025 03:26

dqkqd force-pushed the csv-format-incorrect-type-inference branch from 8f1929a to 9913e62 Compare September 27, 2025 03:33

dqkqd marked this pull request as ready for review September 27, 2025 03:34

dqkqd added 2 commits September 29, 2025 21:48

chore: update test data type for empty table

9523450

test: folder contains empty csv files (with header) and normal files

45e1027

dqkqd force-pushed the csv-format-incorrect-type-inference branch from 9913e62 to 45e1027 Compare September 29, 2025 12:53

github-actions bot added the sqllogictest SQL Logic Tests (.slt) label Sep 29, 2025

alamb reviewed Sep 29, 2025

View reviewed changes

datafusion/datasource-csv/src/file_format.rs Show resolved Hide resolved

alamb approved these changes Sep 29, 2025

View reviewed changes

alamb added 3 commits September 29, 2025 10:01

Update datafusion/datasource-csv/src/file_format.rs

fb55dd4

fmt

fa1dfcb

Merge remote-tracking branch 'apache/main' into csv-format-incorrect-…

61cd151

…type-inference

alamb added this pull request to the merge queue Oct 1, 2025

Merged via the queue into apache:main with commit 6a61304 Oct 1, 2025
28 checks passed

	} else {
	// default to Utf8 for conflicting datatypes (e.g bool and int)
	Field::new(field_name, DataType::Utf8, true)
	}

Uh oh!

fix: ignore DataType::Null in possible types during csv type inference #17796

fix: ignore DataType::Null in possible types during csv type inference #17796

Uh oh!

Conversation

dqkqd commented Sep 26, 2025

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

alamb left a comment

Choose a reason for hiding this comment

Uh oh!

alamb Sep 26, 2025

Choose a reason for hiding this comment

Uh oh!

alamb Sep 26, 2025

Choose a reason for hiding this comment

Uh oh!

dqkqd Sep 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alamb Sep 26, 2025

Choose a reason for hiding this comment

Uh oh!

dqkqd Sep 26, 2025

Choose a reason for hiding this comment

Uh oh!

dqkqd commented Sep 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dqkqd commented Sep 29, 2025

Uh oh!

alamb Sep 29, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

alamb left a comment

Choose a reason for hiding this comment

Uh oh!

alamb Sep 29, 2025

Choose a reason for hiding this comment

Uh oh!

alamb commented Sep 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

alamb commented Oct 1, 2025

Uh oh!

Uh oh!

alamb commented Oct 1, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

fix: ignore `DataType::Null` in possible types during csv type inference #17796

fix: ignore `DataType::Null` in possible types during csv type inference #17796

dqkqd Sep 26, 2025 •

edited

Loading

dqkqd commented Sep 27, 2025 •

edited

Loading

alamb commented Sep 29, 2025 •

edited

Loading