Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Force counts_per_func to nearest prime > counts_per_func ? #66

Open
pik opened this issue Mar 18, 2015 · 0 comments
Open

Force counts_per_func to nearest prime > counts_per_func ? #66

pik opened this issue Mar 18, 2015 · 0 comments

Comments

@pik
Copy link

pik commented Mar 18, 2015

In the paper referenced for the bloom filter hashing algorithm used (https://github.com/bitly/dablooms/blob/master/src/dablooms.c#L185) (Kirsch, Mitzenmacher [2006]) - the recommended algorithm is given by:

gi(x)=h1(x)+ih2(x)(mod p)

From my reading it seems that their theoretical estimate for collision probabilities will only hold provided that counts_per_func is prime. It is also notable that 64 bits of entropy from MurmurHash3_x64_128 is discarded without being used. I'm not sure how significant the difference is on average, but it may make sense to either:
a) force counts_per_func to the nearest prime greater than counts_per_func
or
b) use a triple hashing algorithm taking advantage of the unused entropy provided by murmur hash

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

1 participant