Lucene internal fields and distribution of data #713

joancf · 2024-08-20T17:05:30Z

joancf
Aug 20, 2024

Hi,
We manage an index with 150M documents each of them having a couple of embeddings (384) dimensions
We indexed the data first with L=99 and K=1 (the default values)
But we thought that k=1 implies that a single cutting plane is defined. So, more or less half of the documents is selected for every cut. And then there is an OR of 99 cuts, so almost every document would be selected as candidate.
Our system is becoming slow, and we are investigating why is so slow (some seconds to run a query, and we get timeouts if we run several in parallel) So we are investigating what's going on:

Are there memory problems? are the caches too small for the vectors?
Are we short of cpu? and the system spends too much time comparing vectors
Is a disk problem , and we are spending too much time retrieving the vectors from disc to compute the cosine?
Is a lsh problem not able to restrict the search to a few documents?
Is a data problem and vectors are too similar, so that all of them are in the same area , and the lsh is not able to discriminate them?

Now we reindex the data with new parameters
After some tests with a smaller set, we decided to use a K of 10 (so we do an AND of 10 cuts and we should select about 10%) of the data. And then we use L=116.
Doing these changes we expected the system to be able to focus a bit more, but execution times do not seem to improve.

We are going to set up the infrastructure to monitor the resources, but also we wanted to check how LSH is behaving, and to do so, we wanted to access the fields where you store the lsh vector, not the embeddings.
I could get a bit into the data to observe that the values are stored as byte array, where the first byte is the LSH's OR combination (related to L) and then we have two bytes related to K (8+2 bits)

But, as I already said, we would like to access the data, to get statistics of the LSH vectors, for example how many documents are stored for each LSH vector, to check if they are evenly distributed or there are big concentrations of documents in some vectors, while others remain empty. To do it, I wanted to use the ES-tersmstat plugin, but it crashes as it is not able to extract the real field name where the vectors are stored from the actual embedding field
So, how can I access the lucene index to get these statistics (value/freq) for sh field ?
Which functions can give me the field name from the embeddings field name?

Thanks

alexklibisz · 2024-08-21T15:19:39Z

alexklibisz
Aug 21, 2024
Maintainer

I think your understanding of LSH is correct. Basically, greater L improves recall possibly at the cost of precision. Greater k improves precision possibly at the cost of recall. The hard part is finding the right values for your data.

I think your understanding of the LSH terms is also roughly correct. Elastiknn stores each hash as a byte array at the same field name (the name of the vector). The byte array is prefixed with the index of L, followed by the actual hash value. For L2LSH, the relevant call is here:

elastiknn/elastiknn-models/src/main/java/com/klibisz/elastiknn/models/L2LshModel.java

Line 79 in 9e320dd

hashes[ixL] = HashAndFreq.once(writeIntsWithPrefix(ixL, ints));

, and the actual method is here:

elastiknn/elastiknn-models/src/main/java/com/klibisz/elastiknn/storage/ByteBufferSerialization.java

Lines 51 to 55 in 9e320dd

    
           public static byte[] writeIntsWithPrefix(int prefix, final int[] iarr) { 
        
               ByteBuffer bb = ByteBuffer.allocate((iarr.length + 1) * numBytesInInt).order(byteOrder); 
        
               bb.asIntBuffer().put(prefix).position(1).put(iarr); 
        
               return bb.array(); 
        
           }

Indeed it might be nice to understand some statistics about the hashes. I don't know of any off-the-shelf tool for this. I wrote some code to parse and analyze a lucene index a couple years ago, but never merged it. IIRC, this is the code: https://github.com/alexklibisz/elastiknn/blob/elastiknn-278-lucene-benchmarks/elastiknn-ann-benchmarks/src/main/scala/com/elastiknn/annb/LuceneIndexMetrics.scala You're welcome to use that as a starting point.

0 replies

alexklibisz · 2024-08-21T15:43:32Z

alexklibisz
Aug 21, 2024
Maintainer

Also, I recommend adding some profiling or at least some timing logs to understand where you're spending most of the query time. That's probably a more efficient use of time compared to investing significant time to build a new index analysis tool.

There was another recent discussion about adding some performance logging: #712. Maybe you can collaborate to propose some debug logs for tracking performance at various sections of the query.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Lucene internal fields and distribution of data #713

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments

{{title}}

{{title}}

Select a reply

Lucene internal fields and distribution of data #713

joancf Aug 20, 2024

Replies: 2 comments

alexklibisz Aug 21, 2024 Maintainer

alexklibisz Aug 21, 2024 Maintainer

joancf
Aug 20, 2024

alexklibisz
Aug 21, 2024
Maintainer

alexklibisz
Aug 21, 2024
Maintainer