-
-
Notifications
You must be signed in to change notification settings - Fork 48.4k
feat: optimizing the prune function at the apriori_algorithm.py archive #12992
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
I hadn't realized that the itemset could be a list of lists. As a result, hashing these data structures was not possible, so I switched to using tuples, which are immutable, as keys for the Counter. After this change, I noticed a slight overhead, since each item now needs to be converted into a tuple to be checked within the Counter structure. Nonetheless, there is a significant efficiency gain in the worst-case scenario, and I believe it will also improve performance in average cases. I have not yet tested these other scenarios or generated their corresponding graphs. Below is the graph reflecting the new modification. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR optimizes the prune
function in the Apriori algorithm implementation to improve performance when checking candidate itemsets. The optimization uses Counter
to precompute item frequencies instead of repeatedly counting occurrences during candidate validation.
Key changes:
- Replaces linear search and counting with hash-based lookup using
Counter
- Reduces time complexity from O(n * c * i) to O(n + c * i)
- Updates function documentation to reflect the optimization
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
>>> prune(itemset, candidates, 3) | ||
[] | ||
""" | ||
itemset_counter = Counter(tuple(x) for x in itemset) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The tuple conversion is performed twice for the same data - once when creating the Counter and again when checking each item. Consider converting items to tuples consistently or using a different approach to avoid this duplication.
Copilot uses AI. Check for mistakes.
tupla = tuple(item) | ||
if tupla not in itemset_counter or itemset_counter[tupla] < length - 1: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The tuple conversion is performed twice for the same data - once when creating the Counter and again when checking each item. Consider converting items to tuples consistently or using a different approach to avoid this duplication.
Copilot uses AI. Check for mistakes.
Describe your change:
Added an optimized version of the
prune
function usingCounter
to improve performancewhen checking candidate itemsets for frequent items.
I used as a test base a gradually increasing size of the
itemset
list to demonstratethe inefficiency of the original algorithm, which had a complexity of O(n * c * i),
where n is the size of
itemset
, c is the number of candidates, and i is the number ofitems in each candidate.
The new solution reduces the complexity to O(n + c * i). Previously, the algorithm would
iterate over
itemset
(O(n)) and count occurrences for each item (O(n)) every time itneeded to check a candidate, resulting in repeated costly operations.
To optimize this, I used an auxiliary dictionary (via
Counter
) where each key is anitem and its value is the number of occurrences in
itemset
. This allows both the checkand count operations to be performed in constant time O(1).
As a result, the performance improvement is significant, at the cost of a small additional
memory usage, which is a worthwhile trade-off. This improvement can be observed by
comparing the execution of both algorithms (as shown in the attached image).
Here is the graph comparing both functions:
pruneOptimized_prune_algoritm_results.pdf
Unit tests were also conducted on my local machine to ensure the consistency of results between the two methods, but they are not included in this PR.
Checklist: