-
Notifications
You must be signed in to change notification settings - Fork 1.7k
fix: correct edge case where haystack with null element returns false instead of null #17818
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
I tried this in SQL on main (without this PR) and I couldn't make a reproducer The current answer matches duckdb > select array_has([null], 'foo');
+-----------------------------------------+
| array_has(make_array(NULL),Utf8("foo")) |
+-----------------------------------------+
| false |
+-----------------------------------------+
1 row(s) fetched.
Elapsed 0.002 seconds.
> select array_has([null], null);
+----------------------------------+
| array_has(make_array(NULL),NULL) |
+----------------------------------+
| NULL |
+----------------------------------+
1 row(s) fetched.
Elapsed 0.001 seconds. And duckdb andrewlamb@Andrews-MacBook-Pro-3 ~ % duckdb
DuckDB v1.4.0 (Andium) b8a06e4a22
Enter ".help" for usage hints.
Connected to a transient in-memory database.
Use ".open FILENAME" to reopen on a persistent database.
D select array_has([null], 'foo');
┌─────────────────────────────────────────┐
│ array_has(main.list_value(NULL), 'foo') │
│ boolean │
├─────────────────────────────────────────┤
│ false │
└─────────────────────────────────────────┘
D select array_has([null], null);
┌────────────────────────────────────────┐
│ array_has(main.list_value(NULL), NULL) │
│ boolean │
├────────────────────────────────────────┤
│ NULL │
└────────────────────────────────────────┘ |
Hmm maybe I confused myself on the expected behaviour. Running these tests on main (without this fix): #[test]
fn test_array_has() -> Result<(), DataFusionError> {
let haystack_field = Arc::new(Field::new_list(
"haystack",
Field::new_list("", Field::new("", DataType::Int32, true), true),
true,
));
let needle_field = Arc::new(Field::new("needle", DataType::Int32, true));
let return_field = Arc::new(Field::new_list(
"return",
Field::new("", DataType::Boolean, true),
true,
));
let haystack = ListArray::new(
Field::new_list_field(DataType::Int32, true).into(),
OffsetBuffer::new(vec![0, 1].into()),
Arc::new(Int32Array::from(vec![1])) as ArrayRef,
Some(vec![false].into()),
);
dbg!(&haystack);
let haystack = ColumnarValue::Array(Arc::new(haystack));
let needle = ColumnarValue::Scalar(ScalarValue::Int32(Some(1)));
let result = ArrayHas::new().invoke_with_args(ScalarFunctionArgs {
args: vec![haystack, needle],
arg_fields: vec![haystack_field, needle_field],
number_rows: 1,
return_field,
config_options: Arc::new(ConfigOptions::default()),
})?;
dbg!(result);
Ok(())
} Output:
So But if we modify the haystack slightly: #[test]
fn test_array_has2() -> Result<(), DataFusionError> {
let haystack_field = Arc::new(Field::new_list(
"haystack",
Field::new_list("", Field::new("", DataType::Int32, true), true),
true,
));
let needle_field = Arc::new(Field::new("needle", DataType::Int32, true));
let return_field = Arc::new(Field::new_list(
"return",
Field::new("", DataType::Boolean, true),
true,
));
let haystack = ListArray::new(
Field::new_list_field(DataType::Int32, true).into(),
OffsetBuffer::new(vec![0, 0].into()), // HERE - zero length
Arc::new(Int32Array::from(Vec::<i32>::new())) as ArrayRef, // HERE - child array empty
Some(vec![false].into()),
);
dbg!(&haystack);
let haystack = ColumnarValue::Array(Arc::new(haystack));
let needle = ColumnarValue::Scalar(ScalarValue::Int32(Some(1)));
let result = ArrayHas::new().invoke_with_args(ScalarFunctionArgs {
args: vec![haystack, needle],
arg_fields: vec![haystack_field, needle_field],
number_rows: 1,
return_field,
config_options: Arc::new(ConfigOptions::default()),
})?;
dbg!(result);
Ok(())
} Then the output becomes:
Can see the haystack is equivalent in value, but now the output changes to As for reproducing in SQL, it's possible some other optimization rule is kicking in which further complicates trying to keep things consistent; I'll try look into it more 🤔 Edit: ah might have confused myself. So it's more equivalent to something like this in SQL: select array_has(arrow_cast(null, 'List(Int32)'), 1); That is, we pass a null list itself not a list containing a null. |
Thanks @Jefffrey should we put this PR to draft for now? |
Which issue does this PR close?
array_has
returnsfalse
instead ofnull
for[null]
list #17817Rationale for this change
If input of
[null]
is passed as haystack thenarray_has
should returnnull
; there was edge case where if haystack was[null]
but had an empty child array it would return[false]
.What changes are included in this PR?
Look at haystack length not child length for the early exit case.
Are these changes tested?
Yes added unit test (easier to replicate in rust test than SLT).
Are there any user-facing changes?
No.