Replies: 2 comments
-
I think your understanding of LSH is correct. Basically, greater L improves recall possibly at the cost of precision. Greater k improves precision possibly at the cost of recall. The hard part is finding the right values for your data. I think your understanding of the LSH terms is also roughly correct. Elastiknn stores each hash as a byte array at the same field name (the name of the vector). The byte array is prefixed with the index of L, followed by the actual hash value. For L2LSH, the relevant call is here: Indeed it might be nice to understand some statistics about the hashes. I don't know of any off-the-shelf tool for this. I wrote some code to parse and analyze a lucene index a couple years ago, but never merged it. IIRC, this is the code: https://github.com/alexklibisz/elastiknn/blob/elastiknn-278-lucene-benchmarks/elastiknn-ann-benchmarks/src/main/scala/com/elastiknn/annb/LuceneIndexMetrics.scala You're welcome to use that as a starting point. |
Beta Was this translation helpful? Give feedback.
-
Also, I recommend adding some profiling or at least some timing logs to understand where you're spending most of the query time. That's probably a more efficient use of time compared to investing significant time to build a new index analysis tool. There was another recent discussion about adding some performance logging: #712. Maybe you can collaborate to propose some debug logs for tracking performance at various sections of the query. |
Beta Was this translation helpful? Give feedback.
-
Hi,
We manage an index with 150M documents each of them having a couple of embeddings (384) dimensions
We indexed the data first with L=99 and K=1 (the default values)
But we thought that k=1 implies that a single cutting plane is defined. So, more or less half of the documents is selected for every cut. And then there is an OR of 99 cuts, so almost every document would be selected as candidate.
Our system is becoming slow, and we are investigating why is so slow (some seconds to run a query, and we get timeouts if we run several in parallel) So we are investigating what's going on:
Now we reindex the data with new parameters
After some tests with a smaller set, we decided to use a K of 10 (so we do an AND of 10 cuts and we should select about 10%) of the data. And then we use L=116.
Doing these changes we expected the system to be able to focus a bit more, but execution times do not seem to improve.
We are going to set up the infrastructure to monitor the resources, but also we wanted to check how LSH is behaving, and to do so, we wanted to access the fields where you store the lsh vector, not the embeddings.
I could get a bit into the data to observe that the values are stored as byte array, where the first byte is the LSH's OR combination (related to L) and then we have two bytes related to K (8+2 bits)
But, as I already said, we would like to access the data, to get statistics of the LSH vectors, for example how many documents are stored for each LSH vector, to check if they are evenly distributed or there are big concentrations of documents in some vectors, while others remain empty. To do it, I wanted to use the ES-tersmstat plugin, but it crashes as it is not able to extract the real field name where the vectors are stored from the actual embedding field
So, how can I access the lucene index to get these statistics (value/freq) for sh field ?
Which functions can give me the field name from the embeddings field name?
Thanks
Beta Was this translation helpful? Give feedback.
All reactions