Skip to content

Conversation

@dqkqd
Copy link
Contributor

@dqkqd dqkqd commented Sep 26, 2025

Which issue does this PR close?

Rationale for this change

Datafusion cannot infer types correctly for CSV files where one of their chunks only contains NULLs.

Example: Consider the file below

a       // 1st chunk
1       // 1st chunk: Int64
2       // 1st chunk: Int64
<null>  // 2nd chunk: null
<null>  // 2nd chunk: null

Because all the records in the second chunk are nulls, Datafusion sees the possible data types for a as [Int64, Null], and thus infers the data type as Utf8.

} else {
// default to Utf8 for conflicting datatypes (e.g bool and int)
Field::new(field_name, DataType::Utf8, true)
}

What changes are included in this PR?

Ignore DataType::Null when inferring data type from possible data types.

Are these changes tested?

Yes.

Are there any user-facing changes?

No.

@github-actions github-actions bot added core Core DataFusion crate datasource Changes to the datasource crate labels Sep 26, 2025
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you so much @dqkqd for debugging this issue. It is a great find.

I left a few suggestions on how to improve the code, but given this solves the issue I think we can do them as a follow on (or never)

I verified that this PR does fix the original report in #17517:

> select count(*), "Service:Type" from 'services-parquet' GROUP BY 2 order by 1 desc;
+----------+----------------------+
| count(*) | Service:Type         |
+----------+----------------------+
| 64689377 | Sprinter             |
| 33813656 | Stoptrein            |
| 31252046 | Intercity            |
| 3714068  | Sneltrein            |
| 2284824  | Stopbus i.p.v. trein |
| 1642463  | Intercity direct     |
| 1532087  | stoptrein            |
| 1235170  | Stopbus ipv trein    |
| 772315   | Snelbus i.p.v. trein |
| 515543   | Snelbus ipv trein    |
...
| 83       | Train Charter        |
| 37       | Krokus Express       |
| 12       | Niet instappen       |
| 4        | InnovationXpress     |
| 2        | Tram i.p.v. trein    |
+----------+----------------------+
37 row(s) fetched.
Elapsed 0.170 seconds.

However, I couldn't figure out how it worked. 🤔 It seems like this PR has changed the CSV type inference so that it correctly resolves "Stop:Departure time" to a timestamp in 'services-parquet/services-2020.parquet'.

Can you explain why that is?

>  describe 'services-parquet/services-2020.parquet';
+------------------------------+-------------------------+-------------+
| column_name                  | data_type               | is_nullable |
+------------------------------+-------------------------+-------------+
| Service:RDT-ID               | Int64                   | YES         |
| Service:Date                 | Date32                  | YES         |
| Service:Type                 | Utf8View                | YES         |
| Service:Company              | Utf8View                | YES         |
| Service:Train number         | Int64                   | YES         |
| Service:Completely cancelled | Boolean                 | YES         |
| Service:Partly cancelled     | Boolean                 | YES         |
| Service:Maximum delay        | Int64                   | YES         |
| Stop:RDT-ID                  | Int64                   | YES         |
| Stop:Station code            | Utf8View                | YES         |
| Stop:Station name            | Utf8View                | YES         |
| Stop:Arrival time            | Timestamp(Second, None) | YES         |
| Stop:Arrival delay           | Int64                   | YES         |
| Stop:Arrival cancelled       | Boolean                 | YES         |
| Stop:Departure time          | Timestamp(Second, None) | YES         | <-- this column is now correct 
| Stop:Departure delay         | Int64                   | YES         |
| Stop:Departure cancelled     | Boolean                 | YES         |
+------------------------------+-------------------------+-------------+
17 row(s) fetched.
Elapsed 0.008 seconds.

Though for some reason I still can't select * from the original directory of CSV files, which I will file a follow on ticket for

// if there are incompatible types, use DataType::Utf8
match data_type_possibilities.len() {
1 => Field::new(
// determine data type based on possible types, ignoring DataType::Null,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @dqkqd -- this looks like it would work well. I played around with it and I think we might be able to make this simpler and more efficient by passing in the HashTable and removing the the null. Something like this seemed to work locally

// changed signature to take Vec<HashSet<DataType>>.  ----v
fn build_schema_helper(names: Vec<String>, types: Vec<HashSet<DataType>>) -> Schema {
    let fields = names
        .into_iter()
        .zip(types)
        .map(|(field_name, mut data_type_possibilities)| {
            // ripped from arrow::csv::reader::infer_reader_schema_with_csv_options
            // determine data type based on possible types

            // Remove Null (missing column) from possibilities
            data_type_possibilities.remove(&DataType::Null);  <--- changed this

            // if there are incompatible types, use DataType::Utf8
            match data_type_possibilities.len() {
...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Though when I did this I found some test failures

---- datasource::file_format::csv::tests::infer_schema stdout ----

thread 'datasource::file_format::csv::tests::infer_schema' panicked at datafusion/core/src/datasource/file_format/csv.rs:273:9:
assertion `left == right` failed
  left: ["c1: Utf8", "c2: Int64", "c3: Int64", "c4: Int64", "c5: Int64", "c6: Int64", "c7: Int64", "c8: Int64", "c9: Int64", "c10: Utf8", "c11: Float64", "c12: Float64", "c13: Utf8", "c14: Null", "c15: Utf8"]
 right: ["c1: Utf8", "c2: Int64", "c3: Int64", "c4: Int64", "c5: Int64", "c6: Int64", "c7: Int64", "c8: Int64", "c9: Int64", "c10: Utf8", "c11: Float64", "c12: Float64", "c13: Utf8", "c14: Utf8", "c15: Utf8"]
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

---- datasource::file_format::csv::tests::infer_schema_with_null_regex stdout ----

thread 'datasource::file_format::csv::tests::infer_schema_with_null_regex' panicked at datafusion/core/src/datasource/file_format/csv.rs:324:9:
assertion `left == right` failed
  left: ["c1: Utf8", "c2: Int64", "c3: Int64", "c4: Int64", "c5: Int64", "c6: Int64", "c7: Int64", "c8: Int64", "c9: Int64", "c10: Utf8", "c11: Float64", "c12: Float64", "c13: Utf8", "c14: Null", "c15: Null"]
 right: ["c1: Utf8", "c2: Int64", "c3: Int64", "c4: Int64", "c5: Int64", "c6: Int64", "c7: Int64", "c8: Int64", "c9: Int64", "c10: Utf8", "c11: Float64", "c12: Float64", "c13: Utf8", "c14: Utf8", "c15: Utf8"]


failures:
    datasource::file_format::csv::tests::infer_schema
    datasource::file_format::csv::tests::infer_schema_with_null_regex

Copy link
Contributor Author

@dqkqd dqkqd Sep 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for pointing out. I simplified the code as suggested.

These tests failed because there were columns with nulls only, after removing them, the match arm fell to Utf8 case. But they asserted that nulls columns should have data type DataType::Null.

}

#[tokio::test]
async fn test_infer_schema_stream_separated_chunks_with_nulls() -> Result<()> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand how this test coves the new code.

However, I did verify that without this code change, this PR fails like this:

thread 'datasource::file_format::csv::tests::test_infer_schema_stream_separated_chunks_with_nulls' panicked at datafusion/core/src/datasource/file_format/csv.rs:511:9:
assertion `left == right` failed
  left: ["c1: Int64", "c2: Float64"]
 right: ["c1: Utf8", "c2: Utf8"]
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Datafusion infers data type from each chunk separately, then combines all the possible types.

This test creates a ChunkedStore, reading each line as a separated chunk (one of them contains only nulls),
then ensure type inference shouldn't be skewed by null chunks.

I should have commented and make the test clearer.

@dqkqd
Copy link
Contributor Author

dqkqd commented Sep 27, 2025

The test failed. It ensures an empty table should have its columns infer as Uft8.
DuckDB does the same so I think this is correct.

D CREATE TABLE empty AS
  SELECT * FROM read_csv_auto('empty.csv');
D select * from empty;
┌─────────┬─────────┬─────────┐
│   c1    │   c2    │   c3    │
│ varchar │ varchar │ varchar │
├─────────┴─────────┴─────────┤
│           0 rows            │
└─────────────────────────────┘

When I check how DuckDB handles table with null columns, it infer those columns as VARCHAR.

D CREATE TABLE has_nulls_column AS
  SELECT * FROM read_csv_auto('has_nulls_column.csv');
D select * from has_nulls_column;
┌───────┬───────┬─────────┐
│  c1   │  c2   │   c3    │
│ int64 │ int64 │ varchar │
├───────┼───────┼─────────┤
│     1 │     2 │ NULL    │
│     3 │     4 │ NULL    │
└───────┴───────┴─────────┘

However, datafusion infers those as Null. I think we should change them to Utf8.

> CREATE EXTERNAL TABLE has_nulls_column STORED AS CSV LOCATION 'has_nulls_column.csv' OPTIONS ('format.has_header' 'true');
0 row(s) fetched.
Elapsed 0.025 seconds

> select column_name, data_type, ordinal_position from information_schema.columns where table_name='has_nulls_column';
+-------------+-----------+------------------+
| column_name | data_type | ordinal_position |
+-------------+-----------+------------------+
| c1          | Int64     | 0                |
| c2          | Int64     | 1                |
| c3          | Null      | 2                |
+-------------+-----------+------------------+
3 row(s) fetched.
Elapsed 0.010 seconds.

I don't think this is hard to do, just fallback to Utf8 when a column is all nulls.
Then add some testcases for mixed null columns, all null columns, and null regex. (maybe rewrite tests from #13228)

@alamb Would you like me to handle these cases in this PR?
Or I should just handle empty table by returning Utf8 and cover the remaining cases in another PR?

@dqkqd dqkqd force-pushed the csv-format-incorrect-type-inference branch from 20f8e51 to 8f1929a Compare September 27, 2025 03:19
@dqkqd dqkqd marked this pull request as draft September 27, 2025 03:26
@dqkqd dqkqd force-pushed the csv-format-incorrect-type-inference branch from 8f1929a to 9913e62 Compare September 27, 2025 03:33
@dqkqd dqkqd marked this pull request as ready for review September 27, 2025 03:34
@dqkqd
Copy link
Contributor Author

dqkqd commented Sep 29, 2025

I've just realized that returning Utf8 for columns with only nulls (or empty files) causes schema mismatch when reading folders containing those files along with normal files.
So maybe returning DataType::Null is a better choice. I'll revert the code and update testcase.

> cat test_data/a.csv
c1,c2,c3

> cat test_data/b.csv
c1,c2,c3
1,1,1
2,2,2

This fails in main.

DataFusion CLI v50.0.0
> select * from 'test_data';
Arrow error: Schema error: Fail to merge schema field 'c1' because the from data_type = Utf8 does not equal Int64

@dqkqd dqkqd force-pushed the csv-format-incorrect-type-inference branch from 9913e62 to 45e1027 Compare September 29, 2025 12:53
@github-actions github-actions bot added the sqllogictest SQL Logic Tests (.slt) label Sep 29, 2025

// a stream where each line is read as a separate chunk,
// data type for each chunk is inferred separately.
// +----+-----+----+
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thank you for these comments

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @dqkqd -- this now looks really nice 🏅 🏆 ❤️

CREATE EXTERNAL TABLE empty STORED AS CSV LOCATION '../core/tests/data/empty.csv' OPTIONS ('format.has_header' 'true');

query TTI
select column_name, data_type, ordinal_position from information_schema.columns where table_name='empty';;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this makes sense to me

@alamb
Copy link
Contributor

alamb commented Sep 29, 2025

Sadly, I tried this branch with the reproducer in

And it seems that CSV type inference is still not working correctly 😢

To be clear, I don't think anything needs to change in this PR, I was just hoping it also fixed something else, which it did not

Update: @EeshanBembi actually already has a fix for it!

@alamb alamb added this pull request to the merge queue Oct 1, 2025
@alamb
Copy link
Contributor

alamb commented Oct 1, 2025

🦾 🚀

Merged via the queue into apache:main with commit 6a61304 Oct 1, 2025
28 checks passed
@alamb
Copy link
Contributor

alamb commented Oct 1, 2025

Thanks again @dqkqd -- this is a very nice first PR

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

core Core DataFusion crate datasource Changes to the datasource crate sqllogictest SQL Logic Tests (.slt)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Can't read a directory of parquet files: 'Stop:Arrival time' because the from data_type = Timestamp(Second, None) does not equal Utf8

2 participants