-
-
Notifications
You must be signed in to change notification settings - Fork 404
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: Support pandas ExtensionArray ordering #6481
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #6481 +/- ##
==========================================
- Coverage 88.76% 88.76% -0.01%
==========================================
Files 323 323
Lines 68681 68698 +17
==========================================
+ Hits 60963 60978 +15
- Misses 7718 7720 +2 ☔ View full report in Codecov by Sentry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left one comment, but this PR looks good otherwise.
if TYPE_CHECKING: | ||
from typing import TypeVar | ||
|
||
Array = TypeVar("Array", np.ndarray, pd.api.extensions.ExtensionArray) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This type-annotation does not seem right to me. The function where it is used can take more than these two, e.g., pd.Series
or a cupy.Array
. I think we should use something like np.typing.ArrayLike
, but I haven't done any testing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Dataset.dimension_values()
is documented as:
Returns:
NumPy array of values along the requested dimension
This documentation isn’t quite correct, e.g. my code relies on it returning a pd.ExtensionArray
, but a lot of your code relies on dimension_values
’s output e.g.
- having a
.dtype
attribute - having the methods
.astype
,.min
,.max
, … - being iterable
- being sliceable
I’m pretty sure it can’t be a pd.Series
, for these your code passes on the inner np.ndarray | pd.ExtensionArray
, that’s what I’m relying on here!
np.typing.ArrayLike
also includes types like list
or int
, which I’m mostly sure isn’t correct. I mean, some parts of your code base check isinstance(list, vals)
, but others access the methods above without checking.
So can we figure out which types you actually support and actually fix the typing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As mentioned in the original issue, much of the code was written before the invention of typing, so it is a huge lift.
Also worth mentioning is that when written, most of the underlying data structure was numpy. This is no longer the case, as it can be something else, like here with pd.ExtensionArray
.
I tried to see if I could get out some other types than the array type and couldn't.
Your changes LGTM. Thank you for the PR!
Fixes #6452