Prefilter + Approximate Nearest Neighbor #374
Replies: 8 comments
-
Hi @gosha1128 . No there is no re-indexing after the pre-filter stage. Elastiknn basically runs the query as it normally would, but is only given access to vectors in the docs which matched the original query. |
Beta Was this translation helpful? Give feedback.
-
Thanks for your prompt response. Let's take the LSH model. Since the LSH index was built with the original dataset, how do you use that index with the pre-filtered set? |
Beta Was this translation helpful? Give feedback.
-
IIRC, Elastiknn is implemented as a custom query that receives a reader for each segment (an index has >= 1 shard, shard has >= segment). When you use pre-filtering, the reader only exposes the docs which matched the pre-filtering to Elastiknn. So then it runs an LSH query as usual but it can only consider the docs which matched the filter as candidates. |
Beta Was this translation helpful? Give feedback.
-
OK. I think I'm closer to understanding - thanks for your patience.
When you build the LSH model, it creates a bunch of hash buckets, each
containing a bunch of the original vectors. The idea, of course, is to
hash in such a way that avoids having to do an exhaustive search ( at the
expense of a small loss in accuracy ). As you know, that's the benefit of
ANN.
Now, if you first pre-filter and then perform this ANN type of search,
isn't it possible that the hash buckets still contain references to
documents that were filtered?
Do you remove those documents from the hash buckets before the search?
…On Tue, Apr 27, 2021 at 12:22 PM Alex Klibisz ***@***.***> wrote:
IIRC, Elastiknn is implemented as a custom query that receives a reader
for each segment (an index has >= 1 shard, shard has >= segment). When you
use pre-filtering, the reader only exposes the docs which matched the
pre-filtering to Elastiknn. So then it runs an LSH query as usual but it
can only consider the docs which matched the filter as candidates.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#257 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AADL6CJMH5XPJASZTJI4C7TTK4FF7ANCNFSM43VKA3OQ>
.
|
Beta Was this translation helpful? Give feedback.
-
Your understanding is mostly correct. To be clear, the only time buckets get computed for indexed vectors is when they are originally indexed. After that the only buckets that get computed are for the individual query vector.
That's pretty much correct. "references" is probably not the word I would use. The query vector is hashed and those hash values might be equivalent to the hash values of vectors that were filtered out.
Not quite. Elasticsearch basically just makes them invisible to any query that runs after a filter. So Elastiknn fortunately doesn't have to do anything different than it would on a standard non-pre-filtered query. |
Beta Was this translation helpful? Give feedback.
-
Great thanks. I think I just need to understand a bit more about how Elasticsearch/Lucene works. I think of LSH as a bunch of hash tables. Each hash table is essentially a bunch of references to a subset of the original documents. And the documents in one hash table are all related via the hash function for that table. It seems that somehow the pre-filtered documents are removed from hash tables by Elasticsearch before candidates are generated (?). |
Beta Was this translation helpful? Give feedback.
-
This section in the docs explains a bit more on the LSH strategy used in Elastiknn: https://elastiknn.com/api/#lsh-search-strategy I can add a bit more color here. The short answer is that the hashes are, from Elasticsearch/Lucene's perspective, exactly the same thing as words in a text document. So when you do a filter followed by an approximate LSH query, it's no different then doing a filter followed by a standard term query. Take Angular LSH as an example. There are So say Then at query time, you pick the same |
Beta Was this translation helpful? Give feedback.
-
Closing this to keep things tidy. LMK if you have any other questions. |
Beta Was this translation helpful? Give feedback.
-
Your documentation suggests that you support an approximate query after pre-filter.
"If you determine you need an approximate query for re-scoring, you should ensure that candidates = window_size > size. Ideally candidates is 10x-100x larger than size. "
Do you re-index the documents after the pre-filter stage?
Beta Was this translation helpful? Give feedback.
All reactions