Skip to content

Conversation

@koskampt
Copy link

@koskampt koskampt commented Nov 23, 2025

…62501)

@koskampt koskampt requested a review from rhshadrach as a code owner November 23, 2025 15:13
Copy link
Member

@rhshadrach rhshadrach left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR! Please always add tests. Does this also handle the tuple case on L667?

for name in names
)

elif any(isna(k) for k in self.indices.keys()):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This check is expensive - this function is only ever called currently with names a list of length 1, and the rest of the method is O(1) in terms of self.indices. It's called from the inner loop of DataFramGroupBy.fitler as we're iterating over each group. This seems avoidable.

I believe we could change this function to just accept a single name (rather than a list) and then have a special case:

if isna(name):
    return self.indices.get(np.nan, [])

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think self.indices.get(np.nan, []) won't work as the Nan value in the self.indices can not be accessed reliable before changing the keys from Nan to np.nan. I think I have a working solution though. Will supply the updated version of the PR tomorrow.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think self.indices.get(np.nan, []) won't work as the Nan value in the self.indices can not be accessed reliable before changing the keys from Nan to np.nan.

Isn't this what I suggested to do in #63178 (comment)

Copy link
Author

@koskampt koskampt Nov 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I misread your first comment from two days ago. To make sure we are on the same page, we can change the function _get_indices(self, names) to _get_indices(self, name). Changing the list for a single name?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes! It is only ever used with a single name today.

names = (converter(name) for name in names)

return [self.indices.get(name, []) for name in names]
indices = {np.nan if isna(k) else k: v for k, v in self.indices.items()}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems better to do this on indices cached property directly, and only in the case where there is a NaN value with if not self.dropna and self.result_index.hasnans.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, will adjust.

@koskampt koskampt force-pushed the bug-fix-grouby-with-none-values-with-filter branch from edd8a1f to 8d2126a Compare November 25, 2025 19:57
@rhshadrach
Copy link
Member

rhshadrach commented Nov 25, 2025

@koskampt - I opened #63202 to give some idea of what I'm thinking. If you like that, can incorporate it here. But still open to alternative solutions that do not iterate through indices within _get_indices for the reasons provided.

Even with such a solution, will still want to see the result of running the groupby ASVs to evaluate performance impact. I can also help assist here if desired.

@koskampt
Copy link
Author

@rhshadrach I had a look at your pull request and incorporated your suggestions in mine. I also made the change _get_indices(self, names) to _get_indices(self, name).

I am not familiar with the (groupby) ASVs, but I guess it referring to this: https://pandas.pydata.org/community/benchmarks.html. Help would be greatly appreciated, although I will through the docs by myself first.

Copy link
Member

@rhshadrach rhshadrach left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not familiar with the (groupby) ASVs, but I guess it referring to this: https://pandas.pydata.org/community/benchmarks.html. Help would be greatly appreciated, although I will through the docs by myself first.

Correct - if you're using conda for your virtual environment, then this should be sufficient:

asv continuous -f 1.1 upstream/main HEAD -b ^groupby


@final
def _get_indices(self, names):
def _get_indices(self, name):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you rename this _get_index and remove the other _get_index entirely.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done!

elif has_mi:
# MultiIndex has no efficient way to tell if there are NAs
result = {
tuple(np.nan if isna(comp) else comp for comp in key): value
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
tuple(np.nan if isna(comp) else comp for comp in key): value
# error: "Hashable" has no attribute "__iter__" (not iterable)
tuple(np.nan if isna(comp) else comp for comp in key): value # type: ignore[attr-defined]

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added the change.

@koskampt koskampt force-pushed the bug-fix-grouby-with-none-values-with-filter branch from 2ef342b to f7c5e23 Compare November 29, 2025 16:04
@koskampt
Copy link
Author

I was able to get the asv up and running (a couple of days ago). I will run the benchmark with the below command and report back the results.

asv continuous -f 1.1 upstream/main HEAD -b ^groupby

@koskampt
Copy link
Author

Just checking, I also went through asv_bench/benchmarks/groupby.py file but couldn't see a specific benchmark test that cover the case where dropna = False, there are no values and indices is called. Am I missing something or should we add a new benchmark in order to test the performance impact of the change in this pull request?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

BUG: Inconsistent behavior of Groupby with None values with filter

2 participants