-
Notifications
You must be signed in to change notification settings - Fork 90
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
awkward-array doesn't work with pyarrow 19.0.0
#3402
Comments
Can confirm the same error on openSUSE with apache-arrow 19.0.1 Arrow reference: apache/arrow#41667 apache/arrow#44921 |
@martindurant - I need some help here. The PR #3416 fixes the issue by removing the declaration that the columns are def parquet_round_trip(
akarray, extensionarray, tmp_path, categorical_as_dictionary=False
):
filename = os.path.join(tmp_path, "whatever.parquet")
ak.to_parquet(
akarray,
filename,
extensionarray=extensionarray,
categorical_as_dictionary=categorical_as_dictionary,
)
akarray2 = ak.from_parquet(filename)
assert to_list(akarray2) == to_list(akarray)
str_type2 = io.StringIO()
str_type = io.StringIO()
if extensionarray:
akarray2.type.show(stream=str_type2)
akarray.type.show(stream=str_type)
> assert str_type.getvalue() == str_type2.getvalue()
E assert '3 * tuple[[\n float64[parameters={"which": "inner1"}],\n var * float64[parameters={"which": "inner2"}]\n], parameters={"which": "outer"}]\n' == '3 * tuple[[\n ?float64[parameters={"which": "inner1"}],\n option[var * float64[parameters={"which": "inner2"}]]\n], parameters={"which": "outer"}]\n'
E
E 3 * tuple[[
E - ?float64[parameters={"which": "inner1"}],
E ? -
E + float64[parameters={"which": "inner1"}],
E - option[var * float64[parameters={"which": "inner2"}]]
E ? ------- -
E + var * float64[parameters={"which": "inner2"}]
E ], parameters={"which": "outer"}]
akarray = <Array [(1.1, [1.1, ...]), ..., (3.3, ...)] type='3 * tuple[[float64[parame...'>
akarray2 = <Array [(1.1, [1.1, ...]), ..., (3.3, ...)] type='3 * tuple[[?float64[param...'>
categorical_as_dictionary = False
extensionarray = True
filename = '/tmp/pytest-of-runner/pytest-0/test_recordarray_True_True_0/whatever.parquet'
str_type = <_io.StringIO object at 0x7fcd31cdef70>
str_type2 = <_io.StringIO object at 0x7fcd388efd30>
tmp_path = PosixPath('/tmp/pytest-of-runner/pytest-0/test_recordarray_True_True_0')
tests/test_1440_start_v2_to_parquet.py:37: AssertionError Here is a small reproducer: import awkward as ak
import numpy as np
import pyarrow as pa
import pyarrow.parquet as pyarrow_parquet
is_tuple = False
extensionarray = True
akarray = ak.contents.IndexedOptionArray(
ak.index.Index64(np.array([2, 0, -1, 0, 1], dtype=np.int64)),
ak.contents.RecordArray(
[
ak.contents.NumpyArray(
np.array([1.1, 2.2, 3.3]), parameters={"which": "inner1"}
),
ak.contents.ListOffsetArray(
ak.index.Index32(np.array([0, 3, 3, 5], dtype=np.int32)),
ak.contents.NumpyArray(
np.array([1.1, 2.2, 3.3, 4.4, 5.5]),
parameters={"which": "inner2"},
),
),
],
None if is_tuple else ["x", "y"],
parameters={"which": "outer"},
),
)
paarray = akarray.to_arrow(extensionarray=extensionarray)
pa.parquet.write_table(pa.table({"": paarray}), "dummy.filename") |
Quick question: is this an issue only when using extensionarray? I did run the code snippet and has no error, with pyarrow 19.0.1 (released Feb 18). |
Oh, sorry, I missed the line: pa.parquet.write_table(pa.table({"": paarray}), "dummy.filename") |
OK, I can confirm it, and indeed the value of extensionarray doesn't matter. How odd that making the arrow object is fine, but it errors when converting to parquet. |
IMHO, the pyarrow array is correct: it has a validity mask, but its children do not have paarray
<awkward._connect.pyarrow.extn_types.AwkwardArrowArray object at 0x147969160>
-- is_valid:
[
true,
true,
false,
true,
true
]
-- child 0 type: extension<awkward<AwkwardArrowType>>
[
3.3,
1.1,
null,
1.1,
2.2
]
-- child 1 type: extension<awkward<AwkwardArrowType>>
[
[
4.4,
5.5
],
[
1.1,
2.2,
3.3
],
null,
[
1.1,
2.2,
3.3
],
[]
]
>>> paarray[0]
<pyarrow.ExtensionScalar: {'x': 3.3, 'y': [4.4, 5.5]}>
>>> paarray.type
awkward<StructType(struct<x: extension<awkward<AwkwardArrowType>> not null, y: extension<awkward<AwkwardArrowType>> not null>)> |
I don't think so. Removing the inners:
but the struct can be null. |
Also, I'm pretty sure that arrow just defaults to nullable in ALL relevant cases, even when the mask attribute is None (num_nulls=0). Maybe that's what we need to do. If that's what _arrow_needs_option_type() does, then #3416 is correct. |
Yes, it does allow writing it, but the round-trip has an issue now: the types of all children become optional. x_content = ak.highlevel.Array([1.1, 2.2, 3.3, 4.4, 5.5]).layout
z_content = ak.highlevel.Array([1, 2, 3, None, 5]).layout
original = ak.contents.RecordArray(
[
x_content,
ak.contents.UnmaskedArray(x_content),
z_content,
],
["x", "y", "z"],
)
pa_array = original.to_arrow()
reconstituted = ak.from_arrow(pa_array, highlevel=False)
original = ak.contents.ByteMaskedArray(
ak.index.Index8(np.array([False, True, False, False, False], np.int8)),
original,
valid_when=False,
)
pa_array = original.to_arrow()
reconstituted = ak.from_arrow(pa_array, highlevel=False) gives different types: >>> reconstituted.form.type
OptionType(RecordType([OptionType(NumpyType('float64')), OptionType(NumpyType('float64')), OptionType(NumpyType('int64'))], ['x', 'y', 'z']))
>>> original.form.type
OptionType(RecordType([NumpyType('float64'), OptionType(NumpyType('float64')), OptionType(NumpyType('int64'))], ['x', 'y', 'z'])) |
That's OK for extensionarray=False. I'm not sure how reconstituting the ak array happens with the extension. |
Version of Awkward Array
2.7.4
Description and code to reproduce
awkward-array doesn't work with pyarrow
19.0.0
, however, it does work with18
and earlier.Running the test-suite with version
19
, yields:We should make sure that awkward's dependencies are pinned correctly, and also update the pyarrow version dependency for the test-suite here: https://github.com/scikit-hep/awkward/blob/main/requirements-test-full.txt#L6
The text was updated successfully, but these errors were encountered: