Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

patch: group by n_unique #917

Merged
merged 13 commits into from
Sep 6, 2024
Merged

patch: group by n_unique #917

merged 13 commits into from
Sep 6, 2024

Conversation

FBruzzesi
Copy link
Member

What type of PR is this? (check all applicable)

  • 💾 Refactor
  • ✨ Feature
  • 🐛 Bug Fix
  • 🔧 Optimization
  • 📝 Documentation
  • ✅ Test
  • 🐳 Other

Related issues

Checklist

  • Code follows style guide (ruff)
  • Tests added
  • Documented the changes

If you have comments or can explain your changes, please do so below.

Attempt for #915, comments in the code

Comment on lines 20 to 21
"std": "stddev",
"var": "variance",
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Found these other two cases while reading pyarrow grouped table aggregations (https://arrow.apache.org/docs/python/compute.html#py-grouped-aggrs)

Comment on lines 20 to 28
def n_unique() -> dd.Aggregation:
import dask.dataframe as dd # ignore-banned-import

return dd.Aggregation(
name="nunique",
chunk=lambda s: s.apply(lambda x: list(set(x))),
agg=lambda s0: s0.obj.groupby(level=list(range(s0.obj.index.nlevels))).sum(),
finalize=lambda s1: s1.apply(lambda final: len(set(final))),
)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comes from the dask documentation itself.
For some reason, nunique kw only works in

df_dd.groupby("a").b.nunique()

but not in agg context

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for digging this up in the dask docs!

I think this is way more complex than it needs to be - i've pushed something much simpler which seems to work

@FBruzzesi FBruzzesi mentioned this pull request Sep 6, 2024
10 tasks
Copy link
Member

@MarcoGorelli MarcoGorelli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks @FBruzzesi !

the main issue missing was the handling of null values, but i've added some commits to take care of that

@MarcoGorelli MarcoGorelli merged commit ad5616a into main Sep 6, 2024
24 checks passed
@FBruzzesi FBruzzesi deleted the patch/group-by-n-unique branch September 7, 2024 17:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Bug]: n_unique in groupby-agg raises
2 participants