- This project implements the novel CVM counting algorithm described here.
- The paper itself can be found here
- This is a probabilistic algorithm optimized for memory usage, making it ideal for large streams of data.
- Pick a desired accuracy (ε)
- This is how close you want your estimate to be to the true number of distinct elements. A larger ε means you want a more precise estimate. For example, ε = 0.95 means you want your estimate to be within 5% of the actual value.
- Pick a confidence (1-δ)
- This is your level of certainty that the algorithm's estimate will fall within the desired accuracy range. A higher confidence (e.g., 99.9%) means you're very sure the estimate will be accurate, while a lower confidence (e.g., 90%) means there's a higher chance the estimate might be outside the desired range.
- Estimate the total number of words in the stream (m)
- Input a file path
- Put a .txt file in the
input-files
directory (Shakespeare's Hamlet is provided already in that directory as a sample)
- Put a .txt file in the
Then run with
<yarn | npm | pnpm> run calc <ε> <1-δ> <m> <file path>
For example:
yarn run calc .99 .90 32000 ./input-files/hamlet-shakespeare.txt
- This is a probabilistic algorithm, so accuracy will never be 100% unless the buffer size exceeds the number of unique words in the input file.