Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

awkward-array doesn't work with pyarrow 19.0.0 #3402

Open
pfackeldey opened this issue Feb 20, 2025 · 10 comments
Open

awkward-array doesn't work with pyarrow 19.0.0 #3402

pfackeldey opened this issue Feb 20, 2025 · 10 comments
Assignees
Labels
bug The problem described is something that must be fixed

Comments

@pfackeldey
Copy link
Collaborator

Version of Awkward Array

2.7.4

Description and code to reproduce

awkward-array doesn't work with pyarrow 19.0.0, however, it does work with 18 and earlier.

Running the test-suite with version 19, yields:

FAILED tests/test_1125_to_arrow_from_arrow.py::test_recordarray[False-False] - pyarrow.lib.ArrowInvalid: Column 'x' is declared non-nullable but contains nulls
FAILED tests/test_1125_to_arrow_from_arrow.py::test_recordarray[True-False] - pyarrow.lib.ArrowInvalid: Column 'x' is declared non-nullable but contains nulls
FAILED tests/test_1125_to_arrow_from_arrow.py::test_recordarray[True-True] - pyarrow.lib.ArrowInvalid: Column '0' is declared non-nullable but contains nulls
FAILED tests/test_1294_to_and_from_parquet.py::test_recordarray[False-through_parquet-True] - pyarrow.lib.ArrowInvalid: Column '0' is declared non-nullable but contains nulls
FAILED tests/test_1294_to_and_from_parquet.py::test_recordarray[True-through_parquet-True] - pyarrow.lib.ArrowInvalid: Column '0' is declared non-nullable but contains nulls
FAILED tests/test_1440_start_v2_to_parquet.py::test_recordarray[True-True] - RuntimeError: file metadata is only available after writer close

We should make sure that awkward's dependencies are pinned correctly, and also update the pyarrow version dependency for the test-suite here: https://github.com/scikit-hep/awkward/blob/main/requirements-test-full.txt#L6

@pfackeldey pfackeldey added the bug (unverified) The problem described would be a bug, but needs to be triaged label Feb 20, 2025
@bnavigator
Copy link
Contributor

Can confirm the same error on openSUSE with apache-arrow 19.0.1

Arrow reference: apache/arrow#41667 apache/arrow#44921

@ianna ianna added bug The problem described is something that must be fixed and removed bug (unverified) The problem described would be a bug, but needs to be triaged labels Mar 11, 2025
@ianna
Copy link
Collaborator

ianna commented Mar 20, 2025

@martindurant - I need some help here.

The PR #3416 fixes the issue by removing the declaration that the columns are non-nullable that contains nulls. However the column type changes as well. I'm not sure if that is what we want and if not, how to avoid it. Thanks!

    def parquet_round_trip(
        akarray, extensionarray, tmp_path, categorical_as_dictionary=False
    ):
        filename = os.path.join(tmp_path, "whatever.parquet")
        ak.to_parquet(
            akarray,
            filename,
            extensionarray=extensionarray,
            categorical_as_dictionary=categorical_as_dictionary,
        )
        akarray2 = ak.from_parquet(filename)
    
        assert to_list(akarray2) == to_list(akarray)
        str_type2 = io.StringIO()
        str_type = io.StringIO()
        if extensionarray:
            akarray2.type.show(stream=str_type2)
            akarray.type.show(stream=str_type)
>           assert str_type.getvalue() == str_type2.getvalue()
E           assert '3 * tuple[[\n    float64[parameters={"which": "inner1"}],\n    var * float64[parameters={"which": "inner2"}]\n], parameters={"which": "outer"}]\n' == '3 * tuple[[\n    ?float64[parameters={"which": "inner1"}],\n    option[var * float64[parameters={"which": "inner2"}]]\n], parameters={"which": "outer"}]\n'
E             
E               3 * tuple[[
E             -     ?float64[parameters={"which": "inner1"}],
E             ?     -
E             +     float64[parameters={"which": "inner1"}],
E             -     option[var * float64[parameters={"which": "inner2"}]]
E             ?     -------                                             -
E             +     var * float64[parameters={"which": "inner2"}]
E               ], parameters={"which": "outer"}]

akarray    = <Array [(1.1, [1.1, ...]), ..., (3.3, ...)] type='3 * tuple[[float64[parame...'>
akarray2   = <Array [(1.1, [1.1, ...]), ..., (3.3, ...)] type='3 * tuple[[?float64[param...'>
categorical_as_dictionary = False
extensionarray = True
filename   = '/tmp/pytest-of-runner/pytest-0/test_recordarray_True_True_0/whatever.parquet'
str_type   = <_io.StringIO object at 0x7fcd31cdef70>
str_type2  = <_io.StringIO object at 0x7fcd388efd30>
tmp_path   = PosixPath('/tmp/pytest-of-runner/pytest-0/test_recordarray_True_True_0')

tests/test_1440_start_v2_to_parquet.py:37: AssertionError

Here is a small reproducer:

import awkward as ak
import numpy as np
import pyarrow as pa
import pyarrow.parquet as pyarrow_parquet

is_tuple = False
extensionarray = True

akarray = ak.contents.IndexedOptionArray(
    ak.index.Index64(np.array([2, 0, -1, 0, 1], dtype=np.int64)),
    ak.contents.RecordArray(
        [
            ak.contents.NumpyArray(
                np.array([1.1, 2.2, 3.3]), parameters={"which": "inner1"}
            ),
            ak.contents.ListOffsetArray(
                ak.index.Index32(np.array([0, 3, 3, 5], dtype=np.int32)),
                ak.contents.NumpyArray(
                    np.array([1.1, 2.2, 3.3, 4.4, 5.5]),
                    parameters={"which": "inner2"},
                ),
            ),
        ],
        None if is_tuple else ["x", "y"],
        parameters={"which": "outer"},
    ),
)

paarray = akarray.to_arrow(extensionarray=extensionarray)

pa.parquet.write_table(pa.table({"": paarray}), "dummy.filename")

@martindurant
Copy link
Contributor

Quick question: is this an issue only when using extensionarray?

I did run the code snippet and has no error, with pyarrow 19.0.1 (released Feb 18).

@ianna
Copy link
Collaborator

ianna commented Mar 20, 2025

Quick question: is this an issue only when using extensionarray?

I did run the code snippet and has no error, with pyarrow 19.0.1 (released Feb 18).

Oh, sorry, I missed the line:

pa.parquet.write_table(pa.table({"": paarray}), "dummy.filename")

@martindurant
Copy link
Contributor

OK, I can confirm it, and indeed the value of extensionarray doesn't matter. How odd that making the arrow object is fine, but it errors when converting to parquet.

@ianna
Copy link
Collaborator

ianna commented Mar 20, 2025

OK, I can confirm it, and indeed the value of extensionarray doesn't matter. How odd that making the arrow object is fine, but it errors when converting to parquet.

IMHO, the pyarrow array is correct: it has a validity mask, but its children do not have nulls, so their not null type is correct.

 paarray
<awkward._connect.pyarrow.extn_types.AwkwardArrowArray object at 0x147969160>
-- is_valid:
  [
    true,
    true,
    false,
    true,
    true
  ]
-- child 0 type: extension<awkward<AwkwardArrowType>>
  [
    3.3,
    1.1,
    null,
    1.1,
    2.2
  ]
-- child 1 type: extension<awkward<AwkwardArrowType>>
  [
    [
      4.4,
      5.5
    ],
    [
      1.1,
      2.2,
      3.3
    ],
    null,
    [
      1.1,
      2.2,
      3.3
    ],
    []
  ]
>>> paarray[0]
<pyarrow.ExtensionScalar: {'x': 3.3, 'y': [4.4, 5.5]}>
>>> paarray.type
awkward<StructType(struct<x: extension<awkward<AwkwardArrowType>> not null, y: extension<awkward<AwkwardArrowType>> not null>)>

@martindurant
Copy link
Contributor

IMHO, the pyarrow array is correct

I don't think so. Removing the inners:

StructType(struct<...> not null>)

but the struct can be null.

@martindurant
Copy link
Contributor

Also, I'm pretty sure that arrow just defaults to nullable in ALL relevant cases, even when the mask attribute is None (num_nulls=0). Maybe that's what we need to do. If that's what _arrow_needs_option_type() does, then #3416 is correct.

@ianna
Copy link
Collaborator

ianna commented Mar 20, 2025

Also, I'm pretty sure that arrow just defaults to nullable in ALL relevant cases, even when the mask attribute is None (num_nulls=0). Maybe that's what we need to do. If that's what _arrow_needs_option_type() does, then #3416 is correct.

Yes, it does allow writing it, but the round-trip has an issue now: the types of all children become optional.

x_content = ak.highlevel.Array([1.1, 2.2, 3.3, 4.4, 5.5]).layout
z_content = ak.highlevel.Array([1, 2, 3, None, 5]).layout

original = ak.contents.RecordArray(
    [
        x_content,
        ak.contents.UnmaskedArray(x_content),
        z_content,
    ],
    ["x", "y", "z"],
)
pa_array = original.to_arrow()
reconstituted = ak.from_arrow(pa_array, highlevel=False)

original = ak.contents.ByteMaskedArray(
        ak.index.Index8(np.array([False, True, False, False, False], np.int8)),
        original,
        valid_when=False,
    )
pa_array = original.to_arrow()
reconstituted = ak.from_arrow(pa_array, highlevel=False)

gives different types:

>>> reconstituted.form.type
OptionType(RecordType([OptionType(NumpyType('float64')), OptionType(NumpyType('float64')), OptionType(NumpyType('int64'))], ['x', 'y', 'z']))
>>> original.form.type
OptionType(RecordType([NumpyType('float64'), OptionType(NumpyType('float64')), OptionType(NumpyType('int64'))], ['x', 'y', 'z']))

@martindurant
Copy link
Contributor

the types of all children become optional

That's OK for extensionarray=False. I'm not sure how reconstituting the ak array happens with the extension.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug The problem described is something that must be fixed
Projects
None yet
Development

No branches or pull requests

4 participants