Skip to content

Conversation

joaoneto9
Copy link

Describe your change:

Added an optimized version of the prune function using Counter to improve performance
when checking candidate itemsets for frequent items.

I used as a test base a gradually increasing size of the itemset list to demonstrate
the inefficiency of the original algorithm, which had a complexity of O(n * c * i),
where n is the size of itemset, c is the number of candidates, and i is the number of
items in each candidate.

The new solution reduces the complexity to O(n + c * i). Previously, the algorithm would
iterate over itemset (O(n)) and count occurrences for each item (O(n)) every time it
needed to check a candidate, resulting in repeated costly operations.

To optimize this, I used an auxiliary dictionary (via Counter) where each key is an
item and its value is the number of occurrences in itemset. This allows both the check
and count operations to be performed in constant time O(1).

As a result, the performance improvement is significant, at the cost of a small additional
memory usage, which is a worthwhile trade-off. This improvement can be observed by
comparing the execution of both algorithms (as shown in the attached image).

Here is the graph comparing both functions:
pruneOptimized_prune_algoritm_results.pdf

Unit tests were also conducted on my local machine to ensure the consistency of results between the two methods, but they are not included in this PR.

  • Add an algorithm?
  • Fix a bug or typo in an existing algorithm?
  • Add or change doctests? -- Note: Please avoid changing both code and tests in a single pull request.
  • Documentation change?

Checklist:

  • I have read CONTRIBUTING.md.
  • This pull request is all my own work -- I have not plagiarized.
  • I know that pull requests will not be merged if they fail the automated tests.
  • This PR only changes one algorithm file. To ease review, please open separate PRs for separate algorithms.
  • All new Python files are placed inside an existing directory.
  • All filenames are in all lowercase characters with no spaces or dashes.
  • All functions and variable names follow Python naming conventions.
  • All function parameters and return values are annotated with Python type hints.
  • All functions have doctests that pass the automated testing.
  • All new algorithms include at least one URL that points to Wikipedia or another similar explanation.
  • If this pull request resolves one or more open issues then the description above includes the issue number(s) with a closing keyword: "Fixes #ISSUE-NUMBER".

@algorithms-keeper algorithms-keeper bot added the tests are failing Do not merge until tests pass label Sep 24, 2025
@algorithms-keeper algorithms-keeper bot added the awaiting reviews This PR is ready to be reviewed label Sep 24, 2025
@algorithms-keeper algorithms-keeper bot removed tests are failing Do not merge until tests pass labels Sep 24, 2025
@joaoneto9
Copy link
Author

I hadn't realized that the itemset could be a list of lists. As a result, hashing these data structures was not possible, so I switched to using tuples, which are immutable, as keys for the Counter. After this change, I noticed a slight overhead, since each item now needs to be converted into a tuple to be checked within the Counter structure. Nonetheless, there is a significant efficiency gain in the worst-case scenario, and I believe it will also improve performance in average cases. I have not yet tested these other scenarios or generated their corresponding graphs. Below is the graph reflecting the new modification.

pruneOptimized_prune_algoritm_results.pdf

@algorithms-keeper algorithms-keeper bot removed the awaiting reviews This PR is ready to be reviewed label Oct 1, 2025
Copy link

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR optimizes the prune function in the Apriori algorithm implementation to improve performance when checking candidate itemsets. The optimization uses Counter to precompute item frequencies instead of repeatedly counting occurrences during candidate validation.

Key changes:

  • Replaces linear search and counting with hash-based lookup using Counter
  • Reduces time complexity from O(n * c * i) to O(n + c * i)
  • Updates function documentation to reflect the optimization

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

>>> prune(itemset, candidates, 3)
[]
"""
itemset_counter = Counter(tuple(x) for x in itemset)
Copy link
Preview

Copilot AI Oct 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The tuple conversion is performed twice for the same data - once when creating the Counter and again when checking each item. Consider converting items to tuples consistently or using a different approach to avoid this duplication.

Copilot uses AI. Check for mistakes.

Comment on lines +54 to +55
tupla = tuple(item)
if tupla not in itemset_counter or itemset_counter[tupla] < length - 1:
Copy link
Preview

Copilot AI Oct 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The tuple conversion is performed twice for the same data - once when creating the Counter and again when checking each item. Consider converting items to tuples consistently or using a different approach to avoid this duplication.

Copilot uses AI. Check for mistakes.

@algorithms-keeper algorithms-keeper bot added the awaiting reviews This PR is ready to be reviewed label Oct 2, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
awaiting reviews This PR is ready to be reviewed
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants