-
Notifications
You must be signed in to change notification settings - Fork 876
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
FPGrowth/FPMax and Association Rules with the existence of missing values (#1004) #1106
Conversation
e39df31
to
7e1aab4
Compare
Check out this pull request on See visual diffs & provide feedback on Jupyter Notebooks. Powered by ReviewNB |
a671f6a
to
9cae476
Compare
@rasbt In metrics such as Zhang's and Jaccard, if |
7d83053
to
260517e
Compare
Just wanted to say thanks for the PR! I first bookmarked it 2 weeks ago, but then I got sick and am just slowly catching up. |
ab9e6e4
to
99dad90
Compare
No worries at all, my pleasure to contribute on your project! No worries at all, I am glad you are feeling better now. I resolved and made the changes you mentioned above, feel free to have another look and let me know if there's anything else! |
Thanks for addressing these! I just had 2 follow-up comments above, but no rush. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome work! And thanks so much for addressing all the minor concerns I have.
No worries at all, my pleasure in contributing. Until next time! |
fixed #1004 |
Memory usage regression after PR #1106 We noticed significant memory usage increase in our production pipeline using fpgrowth algorithm after recent updates. Details:
This suggests potential memory regression after PR #1106 (merged Oct 23, 2024). Could you please investigate if there are any unintended memory usage increases from these changes? Environment:
|
Thanks for letting us know. I will take a look at this in the next few weeks and see what's causing this. In the meantime, I recommend sticking to If you have any ideas @zazass8 let me know |
I investigated the issue. It seems that in Furthermore, on line 40 where I used Finally, in @rasbt if you have any ideas, let me know as well. Thank you for pointing this out @jraymartinez. |
Thanks for the analysis @zazass8! I would say we should maybe investigate why |
Why do we add num_itemsets as required input for association_rules? It can be inferred from df size so at least we should make it optional IMO. |
Firstly, just to make it clearer, The reason it was added is because if we use the algorithm when the input dataset includes missing values the metrics that are generated along with the association rules are updated differently (please have a look at the updated formulae on |
…lues (rasbt#1004) (rasbt#1106) * Updated FPGrowth/FPMax and Association Rules with the existence of missing values * Re-structure and document code * Update unit tests * Update CHANGELOG.md * Modify the corresponding documentation in Jupyter notebooks * Final modifications
Description
This update intends to implement the FP-Growth and FP-Max algorithms from frequent_patterns with the possibility of missing values in the input dataset. The code implements the same structure and logic of the algorithms, while computing the support metric as in "ignoring" the missing values in the data. That gives a more realistic indication of the frequency of existence in the items/itemsets that are generated from the algorithm. Given the output of the algorithm, the corresponding association rules and metrics are also updated taking into account the existence of missing values.
The input accepted for this implementation is a
pandas.DataFrame
that accepts binary input as 0/1 or True/False (as it was originally), as well as 0/1/NaN or True/False/NaN, where NaN isnumpy.nan
.Please also find attached the corresponding paper conducting this research of association rule mining with the existence of missing values.
ragel1998.pdf
Related issues or pull requests
Please use this link for the corresponding discussion of this new feature in the issue tracker.
Pull Request Checklist
./docs/sources/CHANGELOG.md
file (if applicable)./mlxtend/*/tests
directories (if applicable)mlxtend/docs/sources/
(if applicable)PYTHONPATH='.' pytest ./mlxtend -sv
and make sure that all unit tests pass (for small modifications, it might be sufficient to only run the specific test file, e.g.,PYTHONPATH='.' pytest ./mlxtend/classifier/tests/test_stacking_cv_classifier.py -sv
)flake8 ./mlxtend