Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

groupby apply is limited to 5000 rows #217

Open
davhin opened this issue Jul 11, 2023 · 3 comments
Open

groupby apply is limited to 5000 rows #217

davhin opened this issue Jul 11, 2023 · 3 comments
Assignees
Labels
bug Something isn't working

Comments

@davhin
Copy link

davhin commented Jul 11, 2023

apologies if this counts as a duplicate of #202

setting up the data:

data = {f'col{l}': [np.array([i, j, k, l]) for i in range(11) for j in range(31) for k in range(15)] for l in range(4)}
df = pd.DataFrame(data, index=index)

now this fails:
df.iloc[:5001].swifter.groupby(level=0, group_keys=False).apply(lambda x: x)
but succeeds with only 5000 rows.


File [~/projects/backend/.venv/lib/python3.11/site-packages/swifter/swifter.py:638](https://file+.vscode-resource.vscode-cdn.net/Users/davidhinrichs/projects/backend/~/projects/backend/.venv/lib/python3.11/site-packages/swifter/swifter.py:638), in GroupBy.apply(self, func, *args, **kwds)
    [635](file:///Users/davidhinrichs/projects/backend/.venv/lib/python3.11/site-packages/swifter/swifter.py?line=634)     return self._obj_pd.groupby(self._by, axis=self._axis, **self._grpby_kwargs).apply(func, *args, **kwds)
    [637](file:///Users/davidhinrichs/projects/backend/.venv/lib/python3.11/site-packages/swifter/swifter.py?line=636) # Swifter logic can't accurately estimate groupby applies, so always parallelize
--> [638](file:///Users/davidhinrichs/projects/backend/.venv/lib/python3.11/site-packages/swifter/swifter.py?line=637) return self._ray_apply(func, *args, **kwds)

File [~/projects/backend/.venv/lib/python3.11/site-packages/swifter/swifter.py:622](https://file+.vscode-resource.vscode-cdn.net/Users/davidhinrichs/projects/backend/~/projects/backend/.venv/lib/python3.11/site-packages/swifter/swifter.py:622), in GroupBy._ray_apply(self, func, *args, **kwds)
    [619](file:///Users/davidhinrichs/projects/backend/.venv/lib/python3.11/site-packages/swifter/swifter.py?line=618) def _ray_apply(self, func, *args, **kwds):
    [620](file:///Users/davidhinrichs/projects/backend/.venv/lib/python3.11/site-packages/swifter/swifter.py?line=619)     import ray
--> [622](file:///Users/davidhinrichs/projects/backend/.venv/lib/python3.11/site-packages/swifter/swifter.py?line=621)     chunks = self._get_chunks()
    [623](file:///Users/davidhinrichs/projects/backend/.venv/lib/python3.11/site-packages/swifter/swifter.py?line=622)     ray_submit_apply = partial(self._ray_submit_apply, chunks=chunks, func=func, *args, **kwds)
    [624](file:///Users/davidhinrichs/projects/backend/.venv/lib/python3.11/site-packages/swifter/swifter.py?line=623)     apply_chunks = (
    [625](file:///Users/davidhinrichs/projects/backend/.venv/lib/python3.11/site-packages/swifter/swifter.py?line=624)         self._ray_progress_apply(ray_submit_apply, len(chunks)) if self._progress_bar else ray_submit_apply()
    [626](file:///Users/davidhinrichs/projects/backend/.venv/lib/python3.11/site-packages/swifter/swifter.py?line=625)     )

File [~/projects/backend/.venv/lib/python3.11/site-packages/swifter/swifter.py:591](https://file+.vscode-resource.vscode-cdn.net/Users/davidhinrichs/projects/backend/~/projects/backend/.venv/lib/python3.11/site-packages/swifter/swifter.py:591), in GroupBy._get_chunks(self)
    [590](file:///Users/davidhinrichs/projects/backend/.venv/lib/python3.11/site-packages/swifter/swifter.py?line=589) def _get_chunks(self):
--> [591](file:///Users/davidhinrichs/projects/backend/.venv/lib/python3.11/site-packages/swifter/swifter.py?line=590)     subset_df = self._obj_pd.index if self._grpby_index else self._obj_pd[self._by[0]]
    [592](file:///Users/davidhinrichs/projects/backend/.venv/lib/python3.11/site-packages/swifter/swifter.py?line=591)     unique_groups = subset_df.unique()
    [593](file:///Users/davidhinrichs/projects/backend/.venv/lib/python3.11/site-packages/swifter/swifter.py?line=592)     n_splits = min(len(unique_groups), self._npartitions)

TypeError: 'NoneType' object is not subscriptable```

Any insight appreciated :)
@jmcarpenter2
Copy link
Owner

jmcarpenter2 commented Jul 20, 2023

Hey @davhin , thanks for raising this issue and providing a reproducible example. This is an oversight in my implementation of the groupby apply. I failed to incorporate the level parameter appropriately. I only ensured the by parameter worked. Really appreciative of you finding this. I will work on a patch shortly.

@jmcarpenter2 jmcarpenter2 self-assigned this Jul 20, 2023
@jmcarpenter2 jmcarpenter2 added the bug Something isn't working label Jul 20, 2023
@davhin
Copy link
Author

davhin commented Jul 20, 2023

Oh, thank you so much! Glad the example was of service

@KeremAslan
Copy link

Any updates on this? Found that this still applies with 1.4.0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants