Prefilter + Approximate Nearest Neighbor #374

sourcesync · 2021-04-27T16:17:36Z

sourcesync
Apr 27, 2021

Your documentation suggests that you support an approximate query after pre-filter.

"If you determine you need an approximate query for re-scoring, you should ensure that candidates = window_size > size. Ideally candidates is 10x-100x larger than size. "

Do you re-index the documents after the pre-filter stage?

alexklibisz · 2021-04-27T19:01:02Z

alexklibisz
Apr 27, 2021
Maintainer

Hi @gosha1128 . No there is no re-indexing after the pre-filter stage. Elastiknn basically runs the query as it normally would, but is only given access to vectors in the docs which matched the original query.

0 replies

sourcesync · 2021-04-27T19:17:34Z

sourcesync
Apr 27, 2021
Author

Thanks for your prompt response.

Let's take the LSH model. Since the LSH index was built with the original dataset, how do you use that index with the pre-filtered set?

0 replies

alexklibisz · 2021-04-27T19:21:47Z

alexklibisz
Apr 27, 2021
Maintainer

IIRC, Elastiknn is implemented as a custom query that receives a reader for each segment (an index has >= 1 shard, shard has >= segment). When you use pre-filtering, the reader only exposes the docs which matched the pre-filtering to Elastiknn. So then it runs an LSH query as usual but it can only consider the docs which matched the filter as candidates.

0 replies

sourcesync · 2021-04-27T20:14:18Z

sourcesync
Apr 27, 2021
Author

OK. I think I'm closer to understanding - thanks for your patience. When you build the LSH model, it creates a bunch of hash buckets, each containing a bunch of the original vectors. The idea, of course, is to hash in such a way that avoids having to do an exhaustive search ( at the expense of a small loss in accuracy ). As you know, that's the benefit of ANN. Now, if you first pre-filter and then perform this ANN type of search, isn't it possible that the hash buckets still contain references to documents that were filtered? Do you remove those documents from the hash buckets before the search?

…

On Tue, Apr 27, 2021 at 12:22 PM Alex Klibisz ***@***.***> wrote: IIRC, Elastiknn is implemented as a custom query that receives a reader for each segment (an index has >= 1 shard, shard has >= segment). When you use pre-filtering, the reader only exposes the docs which matched the pre-filtering to Elastiknn. So then it runs an LSH query as usual but it can only consider the docs which matched the filter as candidates. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#257 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AADL6CJMH5XPJASZTJI4C7TTK4FF7ANCNFSM43VKA3OQ> .

0 replies

alexklibisz · 2021-04-27T20:19:36Z

alexklibisz
Apr 27, 2021
Maintainer

Your understanding is mostly correct. To be clear, the only time buckets get computed for indexed vectors is when they are originally indexed. After that the only buckets that get computed are for the individual query vector.

Now, if you first pre-filter and then perform this ANN type of search,
isn't it possible that the hash buckets still contain references to
documents that were filtered?

That's pretty much correct. "references" is probably not the word I would use. The query vector is hashed and those hash values might be equivalent to the hash values of vectors that were filtered out.

Do you remove those documents from the hash buckets before the search?

Not quite. Elasticsearch basically just makes them invisible to any query that runs after a filter. So Elastiknn fortunately doesn't have to do anything different than it would on a standard non-pre-filtered query.

0 replies

sourcesync · 2021-04-27T21:05:29Z

sourcesync
Apr 27, 2021
Author

Great thanks. I think I just need to understand a bit more about how Elasticsearch/Lucene works.

I think of LSH as a bunch of hash tables. Each hash table is essentially a bunch of references to a subset of the original documents. And the documents in one hash table are all related via the hash function for that table. It seems that somehow the pre-filtered documents are removed from hash tables by Elasticsearch before candidates are generated (?).

0 replies

alexklibisz · 2021-04-28T01:30:41Z

alexklibisz
Apr 28, 2021
Maintainer

This section in the docs explains a bit more on the LSH strategy used in Elastiknn: https://elastiknn.com/api/#lsh-search-strategy

I can add a bit more color here. The short answer is that the hashes are, from Elasticsearch/Lucene's perspective, exactly the same thing as words in a text document. So when you do a filter followed by an approximate LSH query, it's no different then doing a filter followed by a standard term query.

Take Angular LSH as an example. There are L hash tables and each hash table uses k bits to generate a single hash. Each of these bits is just a random dims-dimensional vector. So your "model" is just L x k randomly generated vectors. Each vector that you index lands on one of two sides of each of these randomly-generated vectors - either the 0 side, or the 1 side. If you're in two dimensions, you can use terms like "left" and "right", but it does generalizes to n dimensions. If your vector is on the 0 side of the random vector, you append a 0 to your hash, else you append a 1 to your hash.

So say k=4. For each of the L tables, you would pick 4 random vectors. For each random vector, you check its position against your vector. You end up with a list like [0, 1, 0, 0]. Then you prepend a prefix to represent which of the L hash tables this hash belongs to. Say it's the 9th hash table. Your final hash is [9, 0, 1, 0, 0]. This is basically the value you store in Lucene. There's a bit of optimization so it's not a literal list, but it's basically a list. From Lucene's perspective, it's no different from storing "cat" or "hello". It's just a string of bytes.

Then at query time, you pick the same L x k random vectors (by using a fixed random seed), and you follow the same process to compute hashes. So now you have a bunch of words, and you can run a regular old Lucene query to find any other docs which have those same words.

0 replies

alexklibisz · 2021-05-01T20:29:42Z

alexklibisz
May 1, 2021
Maintainer

Closing this to keep things tidy. LMK if you have any other questions.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Prefilter + Approximate Nearest Neighbor #374

{{title}}

Replies: 8 comments

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Prefilter + Approximate Nearest Neighbor #374

sourcesync Apr 27, 2021

Replies: 8 comments

alexklibisz Apr 27, 2021 Maintainer

sourcesync Apr 27, 2021 Author

alexklibisz Apr 27, 2021 Maintainer

sourcesync Apr 27, 2021 Author

alexklibisz Apr 27, 2021 Maintainer

sourcesync Apr 27, 2021 Author

alexklibisz Apr 28, 2021 Maintainer

alexklibisz May 1, 2021 Maintainer

sourcesync
Apr 27, 2021

alexklibisz
Apr 27, 2021
Maintainer

sourcesync
Apr 27, 2021
Author

alexklibisz
Apr 27, 2021
Maintainer

sourcesync
Apr 27, 2021
Author

alexklibisz
Apr 27, 2021
Maintainer

sourcesync
Apr 27, 2021
Author

alexklibisz
Apr 28, 2021
Maintainer

alexklibisz
May 1, 2021
Maintainer