Add dictionary filter statistics #48

sadikovi · 2017-02-04T04:21:21Z

This PR adds dictionary filter statistics that are based on java.util.HashSet. Now user can choose between bloom (Bloom filter statistics) and dict (Dictionary filter statistics) when creating index.

Also added/updated filter statistics tests and read correctness tests.

Benchmarks on test dataset of 1,000,000 records with simple Or(EqualTo(code,2),EqualTo(code,120)) query shows that with bloom filter statistics Spark would scan ~8 files. Same dataset with same filter and dictionary filter statistics would read only 2 files (both codes are in different files).

Dictionary filters are slower (x2) to create than bloom filters, and they take more space, stats below for sample dataset:

bloom filters:
total: 6.6M
metadata: 300K
filter file: 2.2K

dict filters:
total: 16M
metadata: 300K
filter file: 9.9K - 19K

coveralls · 2017-02-04T04:27:11Z

Coverage increased (+0.1%) to 95.341% when pulling 003aafd on dictionary-filter-2 into 32f4a6c on master.

sadikovi added 3 commits February 4, 2017 16:28

add dict filter and tests

c04cef6

add read tests

4daa498

update docs

003aafd

sadikovi merged commit 4fadbc9 into master Feb 4, 2017

sadikovi deleted the dictionary-filter-2 branch February 4, 2017 04:59

sadikovi mentioned this pull request Feb 4, 2017

Add dictionary filter statistics #37

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add dictionary filter statistics #48

Add dictionary filter statistics #48

sadikovi commented Feb 4, 2017

coveralls commented Feb 4, 2017 •

edited

Loading

Add dictionary filter statistics #48

Add dictionary filter statistics #48

Conversation

sadikovi commented Feb 4, 2017

coveralls commented Feb 4, 2017 • edited Loading

coveralls commented Feb 4, 2017 •

edited

Loading