-
Couldn't load subscription status.
- Fork 1.7k
fix: ignore DataType::Null in possible types during csv type inference
#17796
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you so much @dqkqd for debugging this issue. It is a great find.
I left a few suggestions on how to improve the code, but given this solves the issue I think we can do them as a follow on (or never)
I verified that this PR does fix the original report in #17517:
> select count(*), "Service:Type" from 'services-parquet' GROUP BY 2 order by 1 desc;
+----------+----------------------+
| count(*) | Service:Type |
+----------+----------------------+
| 64689377 | Sprinter |
| 33813656 | Stoptrein |
| 31252046 | Intercity |
| 3714068 | Sneltrein |
| 2284824 | Stopbus i.p.v. trein |
| 1642463 | Intercity direct |
| 1532087 | stoptrein |
| 1235170 | Stopbus ipv trein |
| 772315 | Snelbus i.p.v. trein |
| 515543 | Snelbus ipv trein |
...
| 83 | Train Charter |
| 37 | Krokus Express |
| 12 | Niet instappen |
| 4 | InnovationXpress |
| 2 | Tram i.p.v. trein |
+----------+----------------------+
37 row(s) fetched.
Elapsed 0.170 seconds.However, I couldn't figure out how it worked. 🤔 It seems like this PR has changed the CSV type inference so that it correctly resolves "Stop:Departure time" to a timestamp in 'services-parquet/services-2020.parquet'.
Can you explain why that is?
> describe 'services-parquet/services-2020.parquet';
+------------------------------+-------------------------+-------------+
| column_name | data_type | is_nullable |
+------------------------------+-------------------------+-------------+
| Service:RDT-ID | Int64 | YES |
| Service:Date | Date32 | YES |
| Service:Type | Utf8View | YES |
| Service:Company | Utf8View | YES |
| Service:Train number | Int64 | YES |
| Service:Completely cancelled | Boolean | YES |
| Service:Partly cancelled | Boolean | YES |
| Service:Maximum delay | Int64 | YES |
| Stop:RDT-ID | Int64 | YES |
| Stop:Station code | Utf8View | YES |
| Stop:Station name | Utf8View | YES |
| Stop:Arrival time | Timestamp(Second, None) | YES |
| Stop:Arrival delay | Int64 | YES |
| Stop:Arrival cancelled | Boolean | YES |
| Stop:Departure time | Timestamp(Second, None) | YES | <-- this column is now correct
| Stop:Departure delay | Int64 | YES |
| Stop:Departure cancelled | Boolean | YES |
+------------------------------+-------------------------+-------------+
17 row(s) fetched.
Elapsed 0.008 seconds.
Though for some reason I still can't select * from the original directory of CSV files, which I will file a follow on ticket for
| // if there are incompatible types, use DataType::Utf8 | ||
| match data_type_possibilities.len() { | ||
| 1 => Field::new( | ||
| // determine data type based on possible types, ignoring DataType::Null, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @dqkqd -- this looks like it would work well. I played around with it and I think we might be able to make this simpler and more efficient by passing in the HashTable and removing the the null. Something like this seemed to work locally
// changed signature to take Vec<HashSet<DataType>>. ----v
fn build_schema_helper(names: Vec<String>, types: Vec<HashSet<DataType>>) -> Schema {
let fields = names
.into_iter()
.zip(types)
.map(|(field_name, mut data_type_possibilities)| {
// ripped from arrow::csv::reader::infer_reader_schema_with_csv_options
// determine data type based on possible types
// Remove Null (missing column) from possibilities
data_type_possibilities.remove(&DataType::Null); <--- changed this
// if there are incompatible types, use DataType::Utf8
match data_type_possibilities.len() {
...There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Though when I did this I found some test failures
---- datasource::file_format::csv::tests::infer_schema stdout ----
thread 'datasource::file_format::csv::tests::infer_schema' panicked at datafusion/core/src/datasource/file_format/csv.rs:273:9:
assertion `left == right` failed
left: ["c1: Utf8", "c2: Int64", "c3: Int64", "c4: Int64", "c5: Int64", "c6: Int64", "c7: Int64", "c8: Int64", "c9: Int64", "c10: Utf8", "c11: Float64", "c12: Float64", "c13: Utf8", "c14: Null", "c15: Utf8"]
right: ["c1: Utf8", "c2: Int64", "c3: Int64", "c4: Int64", "c5: Int64", "c6: Int64", "c7: Int64", "c8: Int64", "c9: Int64", "c10: Utf8", "c11: Float64", "c12: Float64", "c13: Utf8", "c14: Utf8", "c15: Utf8"]
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
---- datasource::file_format::csv::tests::infer_schema_with_null_regex stdout ----
thread 'datasource::file_format::csv::tests::infer_schema_with_null_regex' panicked at datafusion/core/src/datasource/file_format/csv.rs:324:9:
assertion `left == right` failed
left: ["c1: Utf8", "c2: Int64", "c3: Int64", "c4: Int64", "c5: Int64", "c6: Int64", "c7: Int64", "c8: Int64", "c9: Int64", "c10: Utf8", "c11: Float64", "c12: Float64", "c13: Utf8", "c14: Null", "c15: Null"]
right: ["c1: Utf8", "c2: Int64", "c3: Int64", "c4: Int64", "c5: Int64", "c6: Int64", "c7: Int64", "c8: Int64", "c9: Int64", "c10: Utf8", "c11: Float64", "c12: Float64", "c13: Utf8", "c14: Utf8", "c15: Utf8"]
failures:
datasource::file_format::csv::tests::infer_schema
datasource::file_format::csv::tests::infer_schema_with_null_regex
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for pointing out. I simplified the code as suggested.
These tests failed because there were columns with nulls only, after removing them, the match arm fell to Utf8 case. But they asserted that nulls columns should have data type DataType::Null.
| } | ||
|
|
||
| #[tokio::test] | ||
| async fn test_infer_schema_stream_separated_chunks_with_nulls() -> Result<()> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand how this test coves the new code.
However, I did verify that without this code change, this PR fails like this:
thread 'datasource::file_format::csv::tests::test_infer_schema_stream_separated_chunks_with_nulls' panicked at datafusion/core/src/datasource/file_format/csv.rs:511:9:
assertion `left == right` failed
left: ["c1: Int64", "c2: Float64"]
right: ["c1: Utf8", "c2: Utf8"]
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Datafusion infers data type from each chunk separately, then combines all the possible types.
This test creates a ChunkedStore, reading each line as a separated chunk (one of them contains only nulls),
then ensure type inference shouldn't be skewed by null chunks.
I should have commented and make the test clearer.
|
The test failed. It ensures an empty table should have its columns infer as D CREATE TABLE empty AS
SELECT * FROM read_csv_auto('empty.csv');
D select * from empty;
┌─────────┬─────────┬─────────┐
│ c1 │ c2 │ c3 │
│ varchar │ varchar │ varchar │
├─────────┴─────────┴─────────┤
│ 0 rows │
└─────────────────────────────┘When I check how DuckDB handles table with null columns, it infer those columns as D CREATE TABLE has_nulls_column AS
SELECT * FROM read_csv_auto('has_nulls_column.csv');
D select * from has_nulls_column;
┌───────┬───────┬─────────┐
│ c1 │ c2 │ c3 │
│ int64 │ int64 │ varchar │
├───────┼───────┼─────────┤
│ 1 │ 2 │ NULL │
│ 3 │ 4 │ NULL │
└───────┴───────┴─────────┘However, datafusion infers those as > CREATE EXTERNAL TABLE has_nulls_column STORED AS CSV LOCATION 'has_nulls_column.csv' OPTIONS ('format.has_header' 'true');
0 row(s) fetched.
Elapsed 0.025 seconds
> select column_name, data_type, ordinal_position from information_schema.columns where table_name='has_nulls_column';
+-------------+-----------+------------------+
| column_name | data_type | ordinal_position |
+-------------+-----------+------------------+
| c1 | Int64 | 0 |
| c2 | Int64 | 1 |
| c3 | Null | 2 |
+-------------+-----------+------------------+
3 row(s) fetched.
Elapsed 0.010 seconds.I don't think this is hard to do, just fallback to @alamb Would you like me to handle these cases in this PR? |
20f8e51 to
8f1929a
Compare
8f1929a to
9913e62
Compare
|
I've just realized that returning This fails in |
9913e62 to
45e1027
Compare
|
|
||
| // a stream where each line is read as a separate chunk, | ||
| // data type for each chunk is inferred separately. | ||
| // +----+-----+----+ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thank you for these comments
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @dqkqd -- this now looks really nice 🏅 🏆 ❤️
| CREATE EXTERNAL TABLE empty STORED AS CSV LOCATION '../core/tests/data/empty.csv' OPTIONS ('format.has_header' 'true'); | ||
|
|
||
| query TTI | ||
| select column_name, data_type, ordinal_position from information_schema.columns where table_name='empty';; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this makes sense to me
|
Sadly, I tried this branch with the reproducer in And it seems that CSV type inference is still not working correctly 😢 To be clear, I don't think anything needs to change in this PR, I was just hoping it also fixed something else, which it did not Update: @EeshanBembi actually already has a fix for it! |
|
🦾 🚀 |
|
Thanks again @dqkqd -- this is a very nice first PR |
Which issue does this PR close?
Rationale for this change
Datafusion cannot infer types correctly for CSV files where one of their chunks only contains NULLs.
Example: Consider the file below
Because all the records in the second chunk are nulls, Datafusion sees the possible data types for
aas[Int64, Null], and thus infers the data type asUtf8.datafusion/datafusion/datasource-csv/src/file_format.rs
Lines 610 to 613 in 691dd47
What changes are included in this PR?
Ignore
DataType::Nullwhen inferring data type from possible data types.Are these changes tested?
Yes.
Are there any user-facing changes?
No.