Skip to content

Conversation

Jefffrey
Copy link
Contributor

Which issue does this PR close?

Rationale for this change

If input of [null] is passed as haystack then array_has should return null; there was edge case where if haystack was [null] but had an empty child array it would return [false].

What changes are included in this PR?

Look at haystack length not child length for the early exit case.

Are these changes tested?

Yes added unit test (easier to replicate in rust test than SLT).

Are there any user-facing changes?

No.

@Jefffrey Jefffrey marked this pull request as ready for review September 29, 2025 08:07
@alamb alamb added the api change Changes the API exposed to users of the crate label Sep 29, 2025
@alamb
Copy link
Contributor

alamb commented Sep 29, 2025

I tried this in SQL on main (without this PR) and I couldn't make a reproducer

The current answer matches duckdb

> select array_has([null], 'foo');
+-----------------------------------------+
| array_has(make_array(NULL),Utf8("foo")) |
+-----------------------------------------+
| false                                   |
+-----------------------------------------+
1 row(s) fetched.
Elapsed 0.002 seconds.

> select array_has([null], null);
+----------------------------------+
| array_has(make_array(NULL),NULL) |
+----------------------------------+
| NULL                             |
+----------------------------------+
1 row(s) fetched.
Elapsed 0.001 seconds.

And duckdb

andrewlamb@Andrews-MacBook-Pro-3 ~ % duckdb
DuckDB v1.4.0 (Andium) b8a06e4a22
Enter ".help" for usage hints.
Connected to a transient in-memory database.
Use ".open FILENAME" to reopen on a persistent database.
D select array_has([null], 'foo');
┌─────────────────────────────────────────┐
│ array_has(main.list_value(NULL), 'foo') │
│                 boolean                 │
├─────────────────────────────────────────┤
│ false                                   │
└─────────────────────────────────────────┘
D select array_has([null], null);
┌────────────────────────────────────────┐
│ array_has(main.list_value(NULL), NULL) │
│                boolean                 │
├────────────────────────────────────────┤
│ NULL                                   │
└────────────────────────────────────────┘

@Jefffrey
Copy link
Contributor Author

Jefffrey commented Sep 30, 2025

Hmm maybe I confused myself on the expected behaviour. Running these tests on main (without this fix):

#[test]
fn test_array_has() -> Result<(), DataFusionError> {
    let haystack_field = Arc::new(Field::new_list(
        "haystack",
        Field::new_list("", Field::new("", DataType::Int32, true), true),
        true,
    ));
    let needle_field = Arc::new(Field::new("needle", DataType::Int32, true));
    let return_field = Arc::new(Field::new_list(
        "return",
        Field::new("", DataType::Boolean, true),
        true,
    ));

    let haystack = ListArray::new(
        Field::new_list_field(DataType::Int32, true).into(),
        OffsetBuffer::new(vec![0, 1].into()),
        Arc::new(Int32Array::from(vec![1])) as ArrayRef,
        Some(vec![false].into()),
    );

    dbg!(&haystack);

    let haystack = ColumnarValue::Array(Arc::new(haystack));
    let needle = ColumnarValue::Scalar(ScalarValue::Int32(Some(1)));

    let result = ArrayHas::new().invoke_with_args(ScalarFunctionArgs {
        args: vec![haystack, needle],
        arg_fields: vec![haystack_field, needle_field],
        number_rows: 1,
        return_field,
        config_options: Arc::new(ConfigOptions::default()),
    })?;

    dbg!(result);

    Ok(())
}

Output:

[datafusion/functions-nested/src/array_has.rs:771:9] &haystack = ListArray
[
  null,
]
[datafusion/functions-nested/src/array_has.rs:784:9] result = Array(
    BooleanArray
    [
      null,
    ],
)
test array_has::tests::test_array_has ... ok

So [null] haystack returns null output.

But if we modify the haystack slightly:

#[test]
fn test_array_has2() -> Result<(), DataFusionError> {
    let haystack_field = Arc::new(Field::new_list(
        "haystack",
        Field::new_list("", Field::new("", DataType::Int32, true), true),
        true,
    ));
    let needle_field = Arc::new(Field::new("needle", DataType::Int32, true));
    let return_field = Arc::new(Field::new_list(
        "return",
        Field::new("", DataType::Boolean, true),
        true,
    ));

    let haystack = ListArray::new(
        Field::new_list_field(DataType::Int32, true).into(),
        OffsetBuffer::new(vec![0, 0].into()), // HERE - zero length
        Arc::new(Int32Array::from(Vec::<i32>::new())) as ArrayRef, // HERE - child array empty
        Some(vec![false].into()),
    );

    dbg!(&haystack);

    let haystack = ColumnarValue::Array(Arc::new(haystack));
    let needle = ColumnarValue::Scalar(ScalarValue::Int32(Some(1)));

    let result = ArrayHas::new().invoke_with_args(ScalarFunctionArgs {
        args: vec![haystack, needle],
        arg_fields: vec![haystack_field, needle_field],
        number_rows: 1,
        return_field,
        config_options: Arc::new(ConfigOptions::default()),
    })?;

    dbg!(result);

    Ok(())
}

Then the output becomes:

[datafusion/functions-nested/src/array_has.rs:810:9] &haystack = ListArray
[
  null,
]
[datafusion/functions-nested/src/array_has.rs:823:9] result = Array(
    BooleanArray
    [
      false,
    ],
)
test array_has::tests::test_array_has2 ... ok

Can see the haystack is equivalent in value, but now the output changes to [false]. This discrepancy is what the PR is trying to address but I might have to think more about what the correct behaviour for both cases is (this PR assumes the output should be [null] but it seems that might not be the case).

As for reproducing in SQL, it's possible some other optimization rule is kicking in which further complicates trying to keep things consistent; I'll try look into it more 🤔

Edit: ah might have confused myself. So it's more equivalent to something like this in SQL:

select array_has(arrow_cast(null, 'List(Int32)'), 1);

That is, we pass a null list itself not a list containing a null.

@comphead
Copy link
Contributor

Thanks @Jefffrey should we put this PR to draft for now?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api change Changes the API exposed to users of the crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

array_has returns false instead of null for [null] list
3 participants