Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add dictionary filter statistics #48

Merged
merged 3 commits into from
Feb 4, 2017
Merged

Add dictionary filter statistics #48

merged 3 commits into from
Feb 4, 2017

Conversation

sadikovi
Copy link
Member

@sadikovi sadikovi commented Feb 4, 2017

This PR adds dictionary filter statistics that are based on java.util.HashSet. Now user can choose between bloom (Bloom filter statistics) and dict (Dictionary filter statistics) when creating index.

Also added/updated filter statistics tests and read correctness tests.

Benchmarks on test dataset of 1,000,000 records with simple Or(EqualTo(code,2),EqualTo(code,120)) query shows that with bloom filter statistics Spark would scan ~8 files. Same dataset with same filter and dictionary filter statistics would read only 2 files (both codes are in different files).

Dictionary filters are slower (x2) to create than bloom filters, and they take more space, stats below for sample dataset:

bloom filters:
total: 6.6M
metadata: 300K
filter file: 2.2K

dict filters:
total: 16M
metadata: 300K
filter file: 9.9K - 19K

@coveralls
Copy link

coveralls commented Feb 4, 2017

Coverage Status

Coverage increased (+0.1%) to 95.341% when pulling 003aafd on dictionary-filter-2 into 32f4a6c on master.

@sadikovi sadikovi merged commit 4fadbc9 into master Feb 4, 2017
@sadikovi sadikovi deleted the dictionary-filter-2 branch February 4, 2017 04:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants