added bottleneck for nan calculations #306

ryanhausen · 2024-07-30T16:33:15Z

Reference Issues/PRs

None

What does this implement/fix? Explain

Changes nan related calculations in treeple.stats.utils to use bottleneck rather numpy. In my testing, this seems to offer a significant performance speed up, see attached doc:
profiling.pdf

I am also attached the cProfile outputs for the benchmarking in the above pdf.
profiles.zip

Any other comments?

I am opening as a draft as I am not sure if you want to make bottleneck a required dependency or optional. And if optional, how you want to categorize it.

Looking forward to your feedback!

adam2392 · 2024-07-30T20:35:08Z

I would categorize bottleneck as an optional dependency within a new category called 'extra'.

Feel free to add the relevant entry here: https://github.com/neurodata/treeple/blob/main/pyproject.toml

In addition, can you add a CHANGELOG entry?

codecov · 2024-07-31T18:41:11Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 78.66%. Comparing base (dd28c41) to head (7244b98).
Report is 4 commits behind head on main.

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #306      +/-   ##
==========================================
+ Coverage   78.53%   78.66%   +0.13%     
==========================================
  Files          24       24              
  Lines        2250     2264      +14     
  Branches      413      417       +4     
==========================================
+ Hits         1767     1781      +14     
  Misses        352      352              
  Partials      131      131

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

adam2392 · 2024-07-31T21:21:45Z

Do you know how other packages test optional dependencies? It would be great to add a unit-test that runs the stuff w/ and w/o bottleneck to assert the answer is the same.

ryanhausen · 2024-08-01T13:20:17Z

EDIT: Sorry I didn't notice your comment above. Ya I wanted to add a test too. Right now bottleneck only touches two things, _non_nan_samples and _parallel_build_null_forests. I added a test for _non_nan_samples, but _parallel_build_null_forests does quite a bit and doesn't seem to have it's own test. It could be refactored into multiple more modular functions and then they could all be separately tested, but to follow the existing testing structure, I modified one of the other tests to turn off/on bottleneck.

@adam2392 Would you mind taking a look again. I made a couple changes to make it testable, added a test, and changed an existing test. I am failing codecov, but I'm not sure where, the link from codecov isn't showing me the file. I am sure it's me.

adam2392 · 2024-08-01T13:29:24Z

It shows this:

ryanhausen · 2024-08-01T13:31:27Z

Interesting, ok thanks. I'll fix it.

ryanhausen · 2024-08-01T15:18:05Z

@adam2392 I fixed coverage locally. However, by design, bottleneck isn't included in the build because it's optional. I can add it to the build or test requirements files and that would force it into the CI process. Do you have a preference for which one?

adam2392 · 2024-08-01T15:26:00Z

Yes in the CI that tests coverage, we should add the extra group of installs.

ryanhausen · 2024-08-01T16:53:40Z

@adam2392 I got it added to the build step by overriding pip install in spin. It looks like the mac builds for python 3.9/10 aren't happy about something, but the others are ok. I don't have a mac to test it. Do you know have you seen this behavior before?

adam2392

I would remove the changes made in spin commands. In the GH actions workflow for build_and_test_slow add a pip install for 'extras' where the installation occurs.

Pip install .[extras]

ryanhausen · 2024-08-01T17:14:20Z

build_and_test_slow uses spin for the install too right? Would installing the package via pip not create other issues rather than using spin?

ryanhausen · 2024-08-01T17:15:28Z

Would you mind rerunning those actions? If it doesn't go away, then maybe if there is an issue with bottleneck, that we need to know about

adam2392 · 2024-08-01T19:21:28Z

build_and_test_slow uses spin for the install too right? Would installing the package via pip not create other issues rather than using spin?

No for some unfortunate reasons, we use pip install from a requirements file:

treeple/.github/workflows/main.yml

Lines 180 to 186 in 1e62fe3

    
                 - name: Install Python packages 
        
                   run: | 
        
                     pip install spin 
        
                     spin setup-submodule 
        
                     pip install compilers 
        
                     pip install -r build_requirements.txt 
        
                     pip install -r test_requirements.txt

So I would just add a pip install .[extra] step there, revert the changes in .spin/cmds.py, and see if that addresses the coverage issue.

Note ./spin install will not install anything besides the treeple package, and its required dependencies, so all extra installations, such as for docs, testing, and style checks are separate.

adam2392 · 2024-08-01T19:22:40Z

Now that I think about it, you can also add bottleneck to the test group, since we always want it for unit-testing. If you go this route, you would also just add to the test_requirements.txt file.

ryanhausen · 2024-08-02T02:44:50Z

@adam2392 would you mind taking a look when you have the chance.

adam2392

Thanks for the PR! @ryanhausen LGTM once you fix the minor comment.

adam2392 · 2024-08-02T03:12:33Z

doc/whats_new/v0.8.rst

I think we're on v0.10 now so you can just move this diff to that file

ryanhausen · 2024-08-02T13:56:18Z

@adam2392 sorry I missed your comment I merged with latest from main and updated the version in the docs. Looks like there was a weird error with one of the Mac builds. Doesn't seem related to the code though.

adam2392 · 2024-08-02T14:07:02Z

Cool. Thanks for the PR @ryanhausen! Can you take a look at what the differential is when using jovo's suggestion?

I.e. pre-allocate less nans than needed

ryanhausen · 2024-08-02T14:19:40Z

Yep!

ryanhausen added 3 commits July 30, 2024 12:16

added bottleneck for nan cacluations

bd22121

updated formatting with black

9c88e0f

updated FUNDING.yml formatting

a795ee6

ryanhausen added 4 commits July 30, 2024 21:30

added bottleneck dependency and updated changelog

7f6478b

refactored bottleneck declaration for testing and added tests

69622b9

updated contributors docs

9048c84

ran isort

f6bfb35

ryanhausen marked this pull request as ready for review August 1, 2024 13:18

updated bottleneck check/tests to improve coverage

e4ad484

overrode install command to include [extra] by default

cc9685e

adam2392 reviewed Aug 1, 2024

View reviewed changes

ryanhausen added 2 commits August 1, 2024 21:07

reverting spin setup

337775c

added bottleneck to test requirements

7244b98

adam2392 approved these changes Aug 2, 2024

View reviewed changes

doc/whats_new/v0.8.rst Outdated

Copy link

Collaborator

adam2392 Aug 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we're on v0.10 now so you can just move this diff to that file

ryanhausen added 2 commits August 2, 2024 09:15

Merge remote-tracking branch 'upstream/main' into bottleneck-for-nans

75309b4

moved changelog info from v0.8 to v0.10

653bbaf

adam2392 merged commit e8c7de5 into neurodata:main Aug 2, 2024
34 of 35 checks passed

ryanhausen deleted the bottleneck-for-nans branch August 14, 2024 13:14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

added bottleneck for nan calculations #306

added bottleneck for nan calculations #306

ryanhausen commented Jul 30, 2024

adam2392 commented Jul 30, 2024

codecov bot commented Jul 31, 2024 •

edited

Loading

adam2392 commented Jul 31, 2024

ryanhausen commented Aug 1, 2024 •

edited

Loading

adam2392 commented Aug 1, 2024

ryanhausen commented Aug 1, 2024 •

edited

Loading

ryanhausen commented Aug 1, 2024

adam2392 commented Aug 1, 2024

ryanhausen commented Aug 1, 2024

adam2392 left a comment

ryanhausen commented Aug 1, 2024

ryanhausen commented Aug 1, 2024

adam2392 commented Aug 1, 2024

adam2392 commented Aug 1, 2024

ryanhausen commented Aug 2, 2024

adam2392 left a comment

adam2392 Aug 2, 2024

ryanhausen commented Aug 2, 2024

adam2392 commented Aug 2, 2024

ryanhausen commented Aug 2, 2024

added bottleneck for nan calculations #306

added bottleneck for nan calculations #306

Conversation

ryanhausen commented Jul 30, 2024

Reference Issues/PRs

What does this implement/fix? Explain

Any other comments?

adam2392 commented Jul 30, 2024

codecov bot commented Jul 31, 2024 • edited Loading

Codecov Report

adam2392 commented Jul 31, 2024

ryanhausen commented Aug 1, 2024 • edited Loading

adam2392 commented Aug 1, 2024

ryanhausen commented Aug 1, 2024 • edited Loading

ryanhausen commented Aug 1, 2024

adam2392 commented Aug 1, 2024

ryanhausen commented Aug 1, 2024

adam2392 left a comment

Choose a reason for hiding this comment

ryanhausen commented Aug 1, 2024

ryanhausen commented Aug 1, 2024

adam2392 commented Aug 1, 2024

adam2392 commented Aug 1, 2024

ryanhausen commented Aug 2, 2024

adam2392 left a comment

Choose a reason for hiding this comment

adam2392 Aug 2, 2024

Choose a reason for hiding this comment

ryanhausen commented Aug 2, 2024

adam2392 commented Aug 2, 2024

ryanhausen commented Aug 2, 2024

codecov bot commented Jul 31, 2024 •

edited

Loading

ryanhausen commented Aug 1, 2024 •

edited

Loading

ryanhausen commented Aug 1, 2024 •

edited

Loading