Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

added bottleneck for nan calculations #306

Merged
merged 13 commits into from
Aug 2, 2024

Conversation

ryanhausen
Copy link
Contributor

Reference Issues/PRs

None

What does this implement/fix? Explain

Changes nan related calculations in treeple.stats.utils to use bottleneck rather numpy. In my testing, this seems to offer a significant performance speed up, see attached doc:
profiling.pdf

I am also attached the cProfile outputs for the benchmarking in the above pdf.
profiles.zip

Any other comments?

I am opening as a draft as I am not sure if you want to make bottleneck a required dependency or optional. And if optional, how you want to categorize it.

Looking forward to your feedback!

@adam2392
Copy link
Collaborator

I would categorize bottleneck as an optional dependency within a new category called 'extra'.

Feel free to add the relevant entry here: https://github.com/neurodata/treeple/blob/main/pyproject.toml

In addition, can you add a CHANGELOG entry?

Copy link

codecov bot commented Jul 31, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 78.66%. Comparing base (dd28c41) to head (7244b98).
Report is 4 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #306      +/-   ##
==========================================
+ Coverage   78.53%   78.66%   +0.13%     
==========================================
  Files          24       24              
  Lines        2250     2264      +14     
  Branches      413      417       +4     
==========================================
+ Hits         1767     1781      +14     
  Misses        352      352              
  Partials      131      131              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@adam2392
Copy link
Collaborator

Do you know how other packages test optional dependencies? It would be great to add a unit-test that runs the stuff w/ and w/o bottleneck to assert the answer is the same.

@ryanhausen ryanhausen marked this pull request as ready for review August 1, 2024 13:18
@ryanhausen
Copy link
Contributor Author

ryanhausen commented Aug 1, 2024

EDIT: Sorry I didn't notice your comment above. Ya I wanted to add a test too. Right now bottleneck only touches two things, _non_nan_samples and _parallel_build_null_forests. I added a test for _non_nan_samples, but _parallel_build_null_forests does quite a bit and doesn't seem to have it's own test. It could be refactored into multiple more modular functions and then they could all be separately tested, but to follow the existing testing structure, I modified one of the other tests to turn off/on bottleneck.

@adam2392 Would you mind taking a look again. I made a couple changes to make it testable, added a test, and changed an existing test. I am failing codecov, but I'm not sure where, the link from codecov isn't showing me the file. I am sure it's me.

@adam2392
Copy link
Collaborator

adam2392 commented Aug 1, 2024

It shows this: Screenshot 2024-08-01 at 9 28 44 AM

@ryanhausen
Copy link
Contributor Author

ryanhausen commented Aug 1, 2024

Interesting, ok thanks. I'll fix it.

@ryanhausen
Copy link
Contributor Author

@adam2392 I fixed coverage locally. However, by design, bottleneck isn't included in the build because it's optional. I can add it to the build or test requirements files and that would force it into the CI process. Do you have a preference for which one?

@adam2392
Copy link
Collaborator

adam2392 commented Aug 1, 2024

Yes in the CI that tests coverage, we should add the extra group of installs.

@ryanhausen
Copy link
Contributor Author

@adam2392 I got it added to the build step by overriding pip install in spin. It looks like the mac builds for python 3.9/10 aren't happy about something, but the others are ok. I don't have a mac to test it. Do you know have you seen this behavior before?

Copy link
Collaborator

@adam2392 adam2392 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would remove the changes made in spin commands. In the GH actions workflow for build_and_test_slow add a pip install for 'extras' where the installation occurs.

Pip install .[extras]

@ryanhausen
Copy link
Contributor Author

build_and_test_slow uses spin for the install too right? Would installing the package via pip not create other issues rather than using spin?

@ryanhausen
Copy link
Contributor Author

Would you mind rerunning those actions? If it doesn't go away, then maybe if there is an issue with bottleneck, that we need to know about

@adam2392
Copy link
Collaborator

adam2392 commented Aug 1, 2024

build_and_test_slow uses spin for the install too right? Would installing the package via pip not create other issues rather than using spin?

No for some unfortunate reasons, we use pip install from a requirements file:

- name: Install Python packages
run: |
pip install spin
spin setup-submodule
pip install compilers
pip install -r build_requirements.txt
pip install -r test_requirements.txt

So I would just add a pip install .[extra] step there, revert the changes in .spin/cmds.py, and see if that addresses the coverage issue.

Note ./spin install will not install anything besides the treeple package, and its required dependencies, so all extra installations, such as for docs, testing, and style checks are separate.

@adam2392
Copy link
Collaborator

adam2392 commented Aug 1, 2024

Now that I think about it, you can also add bottleneck to the test group, since we always want it for unit-testing. If you go this route, you would also just add to the test_requirements.txt file.

@ryanhausen
Copy link
Contributor Author

@adam2392 would you mind taking a look when you have the chance.

Copy link
Collaborator

@adam2392 adam2392 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR! @ryanhausen LGTM once you fix the minor comment.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we're on v0.10 now so you can just move this diff to that file

@ryanhausen
Copy link
Contributor Author

@adam2392 sorry I missed your comment I merged with latest from main and updated the version in the docs. Looks like there was a weird error with one of the Mac builds. Doesn't seem related to the code though.

@adam2392 adam2392 merged commit e8c7de5 into neurodata:main Aug 2, 2024
34 of 35 checks passed
@adam2392
Copy link
Collaborator

adam2392 commented Aug 2, 2024

Cool. Thanks for the PR @ryanhausen! Can you take a look at what the differential is when using jovo's suggestion?

I.e. pre-allocate less nans than needed

@ryanhausen
Copy link
Contributor Author

Yep!

@ryanhausen ryanhausen deleted the bottleneck-for-nans branch August 14, 2024 13:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants