GH-32609: [Python] Add type annotations to PyArrow #47609

rok · 2025-09-20T19:53:21Z

This proposes adding type annotation to pyarrow by adopting pyarrow-stubs into pyarrow. To do so we copy pyarrow-stubs's stubfiles into arrow/python/pyarrow-stubs/, restructure them somewhat and add more annotations. We remove docstrings from annotations and provide a script to include docstrings into stubfiles at wheel-build-time. We also remove overloads from annotations to simplify this PR. We then add annotation checks for all project files. We introduce a CI check to make sure all mypy, pyright and ty annotation checks pass (see python/pyproject.toml for any exceptions).

PR introduces:

adds pyarrow-stubs into arrow/python/pyarrow-stubs/
fixes pyarrow-stubs to pass ty, mypy and pyright check
adds ty, mypy and pyright check to CI (crudely)
adds a tool (update_stub_docstrings.py) to insert annotation docstrings into stubfiles at wheel-build-time

GitHub discussion: A new home for pyarrow-stubs? #45919
GitHub Issue: [Python] Type checking support #32609

dangotbanned

Hey @rok, I come bearing unsolicited suggestions 😉

A lot of this is from 2 recent PRs that have had me battling the current stubs more

python/pyarrow-stubs/_compute.pyi

dangotbanned · 2025-09-30T18:38:21Z

python/pyarrow-stubs/compute.pyi

+def field(*name_or_index: str | tuple[str, ...] | int) -> Expression: ...
+
+
+def scalar(value: bool | float | str) -> Expression: ...


Based on

arrow/python/pyarrow/_compute.pyx

Lines 2859 to 2869 in 13c2615

@staticmethod

def _scalar(value):

cdef:

Scalar scalar

if isinstance(value, Scalar):

scalar = value

else:

scalar = lib.scalar(value)

return Expression.wrap(CMakeScalarExpression(scalar.unwrap()))

The Expression version (pc.scalar) should accept the same types as pa.scalar right?

https://github.com/rok/arrow/blob/24ec3c3b66b84d677caef02075e56703a7ad9d39/python/pyarrow-stubs/scalar.pyi#L400-L406

Ran into it the other day here where I needed to add a cast

https://github.com/narwhals-dev/narwhals/blob/cef3e0670ef2e208b3bfb071487c78de83b25e1f/narwhals/_plan/arrow/acero.py#L64-L65

I'm not sure what are you suggesting. Do you mean:

diff --git i/python/pyarrow-stubs/compute.pyi w/python/pyarrow-stubs/compute.pyi index df660e0c0c..f005c5f552 100644 --- i/python/pyarrow-stubs/compute.pyi +++ w/python/pyarrow-stubs/compute.pyi @@ -84,7 +84,7 @@ _R = TypeVar("_R") def field(*name_or_index: str | tuple[str, ...] | int) -> Expression: ... -def scalar(value: bool | float | str) -> Expression: ... +def scalar(value: Any) -> Expression: ...

Hmm, yeah I guess Any is what you have there so that could work.

But I think it would be more helpful to use something like this to start:
https://github.com/rok/arrow/blob/6a310149ed305d7e2606066f5d0915e9c23310f4/python/pyarrow-stubs/_stubs_typing.pyi#L50

PyScalar: TypeAlias = (bool | int | float | Decimal | str | bytes | dt.date | dt.datetime | dt.time | dt.timedelta)

Then the snippet from (#47609 (comment)) seems to imply pa.Scalar is valid as well.
So maybe this would document it more clearly?

def scalar(value: PyScalar | lib.Scalar[Any] | None) -> Expression: ...

python/pyarrow-stubs/acero.pyi

python/pyarrow-stubs/compute.pyi

python/pyarrow-stubs/_compute.pyi

dangotbanned · 2025-09-30T20:47:53Z

python/pyarrow-stubs/_compute.pyi

+    def name(self) -> str: ...
+    @property
+    def num_kernels(self) -> int: ...
+


#45919 (reply in thread)

I wonder if the overloads can be generated instead of written out and maintained manually.

Took me a while to discover this without it being in the stubs 😅

Suggested change

@property

def kernels(self) -> list[ScalarKernel | VectorKernel | ScalarAggregateKernel | HashAggregateKernel]:

I know this isn't accurate for Function itself, but it's the type returned by FunctionRegistry.get_function

If you wanted to be a bit fancier, maybe add some Generics into the mix?

@rok

look at extracting compute kernel signatures from C++ (valid input types are explicitly stated at registration time).

That would probably be more useful than the route I was going for here.

In python there's only the repr to work with, but there is quite a lot of information encoded in it

import pyarrow.compute as pc >>> pc.get_function("array_take").kernels[:10] [VectorKernel<(primitive, integer) -> computed>, VectorKernel<(binary-like, integer) -> computed>, VectorKernel<(large-binary-like, integer) -> computed>, VectorKernel<(fixed-size-binary-like, integer) -> computed>, VectorKernel<(null, integer) -> computed>, VectorKernel<(Type::DICTIONARY, integer) -> computed>, VectorKernel<(Type::EXTENSION, integer) -> computed>, VectorKernel<(Type::LIST, integer) -> computed>, VectorKernel<(Type::LARGE_LIST, integer) -> computed>, VectorKernel<(Type::LIST_VIEW, integer) -> computed>]

>>> pc.get_function("min_element_wise").kernels[:10] [ScalarKernel<varargs[uint8*] -> uint8>, ScalarKernel<varargs[uint16*] -> uint16>, ScalarKernel<varargs[uint32*] -> uint32>, ScalarKernel<varargs[uint64*] -> uint64>, ScalarKernel<varargs[int8*] -> int8>, ScalarKernel<varargs[int16*] -> int16>, ScalarKernel<varargs[int32*] -> int32>, ScalarKernel<varargs[int64*] -> int64>, ScalarKernel<varargs[float*] -> float>, ScalarKernel<varargs[double*] -> double>]

>>> pc.get_function("approximate_median").kernels [ScalarAggregateKernel<(any) -> double>]

rok · 2025-09-30T21:52:11Z

Oh awesome! Thank you @dangotbanned I love unsolicited suggestions like these! I am at pydata Paris right now so I probably can't reply properly until Monday, but given your experience I'm sure these will be very useful!

rok · 2025-10-02T14:16:05Z

Just a mental note: @pitrou suggested to look at extracting compute kernel signatures from C++ (valid input types are explicitly stated at registration time).

raulcd

@rok first, thanks for the huge amount of work here!! You rock! (see what I did there?)

I've just taken a look on the build/CI part of it.

As suggested yesterday, I think we should add documentation on the Python docs development guide both on how to run type checking and what is expected from pyarrow developers.

What is the workflow expected when working on it?

raulcd · 2025-10-31T09:50:38Z

.github/workflows/python.yml

+      - name: Type check with mypy and pyright
+        run: |-
+            python -m pip install mypy pyright ty griffe libcst pytest hypothesis fsspec scipy-stubs pandas-stubs types-python-dateutil types-psutil types-requests griffe libcst sphinx types-cffi
+            pip install -i https://pypi.anaconda.org/scientific-python-nightly-wheels/simple pyarrow


This is currently installing pyarrow from the nightlies, as part of CI we should test the pyarrow version on the PR / commit / branch.
On this job pyarrow is built via archery on a docker container, it won't be available outside the container. We might want to add a new job only for running type checking where the docker container also runs the type checking or add it to this jobs archery docker run, we can probably drive the check with an environment variable and run this directly on ci/scripts/python_test.sh.

raulcd · 2025-10-31T09:52:05Z

.github/workflows/python.yml

+            pyright
+            ty check
+            cd ..
+            python ./dev/update_stub_docstrings.py -f ./python/pyarrow-stubs


Is the update_stub_docstrings.py something:

devs should run before pushing or via commit hook

CI should run on every PR

something we should run when building sdist/wheels

part of the release process

It's not clear to me.

github-actions bot added the awaiting committer review Awaiting committer review label Sep 20, 2025

rok mentioned this pull request Sep 20, 2025

[Python] Gradually add type checks to Arrow, initial step rok/arrow#45

Closed

rok changed the title ~~[Python] Add type annotations to PyArrow~~ GH-32609: [Python] Add type annotations to PyArrow Sep 20, 2025

apache deleted a comment from github-actions bot Sep 20, 2025

rok force-pushed the pyarrow-stubs-2 branch from a0ce53c to 9c881b4 Compare September 20, 2025 20:09

github-actions bot added the Component: Python label Sep 20, 2025

apache deleted a comment from github-actions bot Sep 20, 2025

rok mentioned this pull request Sep 21, 2025

[Python] Setup type checking with mypy #24376

Open

rok requested review from pitrou and raulcd September 22, 2025 10:30

rok force-pushed the pyarrow-stubs-2 branch 5 times, most recently from b564265 to 127e741 Compare September 22, 2025 23:56

dangotbanned reviewed Sep 30, 2025

View reviewed changes

github-actions bot added awaiting changes Awaiting changes awaiting change review Awaiting change review and removed awaiting committer review Awaiting committer review awaiting changes Awaiting changes awaiting change review Awaiting change review labels Oct 6, 2025

rok force-pushed the pyarrow-stubs-2 branch from 596fd29 to 6a31014 Compare October 6, 2025 17:09

github-actions bot added awaiting change review Awaiting change review awaiting changes Awaiting changes and removed awaiting changes Awaiting changes awaiting change review Awaiting change review labels Oct 6, 2025

rok added 16 commits October 25, 2025 23:31

fix mypy

c48eed2

minor fix

2dbd8c7

minor fix

36a82f1

minor fix

487b582

minor fix

5c8a7d3

fix CI

d717792

minor fix

26b3b4e

test

83ae7b6

fixes

1747bd3

cleanup

85ab2e1

lint

c072b76

fix

77a5b2b

fix

b90c075

more fix

2a31ec3

fixes

a192385

remove some newlines

5e27207

rok force-pushed the pyarrow-stubs-2 branch 3 times, most recently from 4aae1c8 to 80ea044 Compare October 25, 2025 21:57

some fixes

8e50a64

rok force-pushed the pyarrow-stubs-2 branch from 80ea044 to 8e50a64 Compare October 25, 2025 22:08

rok marked this pull request as ready for review October 26, 2025 17:51

rok requested review from AlenkaF, assignUser, jonkeane, kou and wjones127 as code owners October 26, 2025 17:51

raulcd reviewed Oct 31, 2025

View reviewed changes

github-actions bot added awaiting changes Awaiting changes and removed awaiting change review Awaiting change review labels Oct 31, 2025

		def field(*name_or_index: str \| tuple[str, ...] \| int) -> Expression: ...


		def scalar(value: bool \| float \| str) -> Expression: ...

	@staticmethod
	def _scalar(value):
	cdef:
	Scalar scalar

	if isinstance(value, Scalar):
	scalar = value
	else:
	scalar = lib.scalar(value)

	return Expression.wrap(CMakeScalarExpression(scalar.unwrap()))


	@property
	def kernels(self) -> list[ScalarKernel \| VectorKernel \| ScalarAggregateKernel \| HashAggregateKernel]:

GH-32609: [Python] Add type annotations to PyArrow #47609

Are you sure you want to change the base?

GH-32609: [Python] Add type annotations to PyArrow #47609

Conversation

rok commented Sep 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dangotbanned left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rok commented Sep 30, 2025

Uh oh!

rok commented Oct 2, 2025

Uh oh!

raulcd left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

rok commented Sep 20, 2025 •

edited

Loading