-
Notifications
You must be signed in to change notification settings - Fork 49
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Try quick select algorithm for KthGreatest implementation #543
Comments
I've partially implemented this in #603. I based much of the quickselect implementation on this excellent gist: https://gist.github.com/unnikked/14c19ba13f6a4bfd00a3 My latest iteration at time of writing is here: elastiknn/elastiknn-lucene/src/main/java/com/klibisz/elastiknn/search/QuickSelect.java Lines 1 to 46 in e6208a5
The benchmark is here: Lines 41 to 50 in e6208a5
Unfortunately this particular implementation of the quickselect algorithm is somehow actually slower than just sorting. I would speculate that much of this comes from the fact I hae to make a full copy of the array at every iteration. This is necessary as the quickselect method is modifying the array (swapping around values) in order to compute its result, whereas the ArrayHitCounter expects those values to be immutable.
|
Quickselect is about 30% faster when I switch from a fixed pivot to a random pivot, line 19: elastiknn/elastiknn-lucene/src/main/java/com/klibisz/elastiknn/search/QuickSelect.java Lines 1 to 50 in eb51d04
But it still doesn't touch kthGreatest |
This feels similar to using hashmaps instead of arrays to count hits, summarized in this comment: #160 (comment) Right now I'm benchmarking w/ a dataset of 60k vectors (Fashion Mnist). Optimizations like quickselect and primitive hashmaps might make a positive impact when I'm working with far more vectors. But Fashion Mnist is the benchmark I'm trying to optimize for now. |
Closing this for now. Might re-open if/when I'm benchmarking on a larger dataset. |
Background
I think there could be an opportunity to speed up approximate queries by re-implementing the kthGreatest method using the quick select algorithm.
At a high level, the kthGreatest method is used to find the kth greatest document frequency. We give it an array of counts, each one representing the number of times a distinct document was matched against a set of query terms. It returns the kth greatest count. Then we perform exact similarity scoring on each of the documents that match or exceed this kth greatest count.
There are some good example implementations of quick select on Leet code:
Deliverables
Related Issues
Blocked by #525
The text was updated successfully, but these errors were encountered: