-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Remove vector values copy() methods, moving IndexInput.clone() and temp storage into lower-level interfaces #13872
base: main
Are you sure you want to change the base?
Conversation
I'll merge to main soon and let tests noodle on this for a few days before backporting to 11.x. It seems benign, but it's easy to make an accidental slip in the code hurricane |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The API does look cleaner, but I am concerned about heap and performance during graph building.
addAndEnsureDiversity
will create many copies. I would expect this PR to create many more float[dim]
arrays than we would before.
Have you done any benchmarking or profiling on this?
return new Bytes() { | ||
IndexInput input = slice.clone(); | ||
ByteBuffer byteBuffer = ByteBuffer.allocate(byteSize); | ||
; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
; |
Floats rawVectors = rawVectorValues.vectors(); | ||
return new Floats() { | ||
@Override | ||
public float[] get(int ord) throws IOException { | ||
return rawVectors.get(ord); | ||
} | ||
}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Floats rawVectors = rawVectorValues.vectors(); | |
return new Floats() { | |
@Override | |
public float[] get(int ord) throws IOException { | |
return rawVectors.get(ord); | |
} | |
}; | |
return rawVectorValues.vectors(); |
ByteBuffer byteBuffer = ByteBuffer.allocate(dimension); | ||
byte[] binaryValue = byteBuffer.array(); | ||
IndexInput input = slice.clone(); | ||
float[] scoreCorrectionConstant = new float[1]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
all these should be private & final. There are other instances where you do something similar, let's make things final that can be and private things can should be.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
personally I don't care about making these final - the compiler already ensures that they are or it wouldn't let you use them in a closure like this. As for private, I don't think you can make local variables private, but maybe I am missing something.
lucene/core/src/java/org/apache/lucene/index/ExitableDirectoryReader.java
Outdated
Show resolved
Hide resolved
lucene/core/src/java/org/apache/lucene/index/ExitableDirectoryReader.java
Outdated
Show resolved
Hide resolved
lucene/core/src/java/org/apache/lucene/index/SortingCodecReader.java
Outdated
Show resolved
Hide resolved
lucene/core/src/java/org/apache/lucene/index/SortingCodecReader.java
Outdated
Show resolved
Hide resolved
byte[] scratch1 = new byte[vectorByteSize]; | ||
byte[] scratch2 = new byte[vectorByteSize]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
now we allocate scratch even if we don't need it, maybe this isn't that big of a deal?
Same goes for all the other memsegment scorers, we don't really need the scratch unless a memory segment isn't available.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, this just seemed cleaner than trying to make that conditional, and my assumption is these scorers are not created that often? Once per search? Although I guess when indexing that could be a lot (once per doc). The challenge here is that getSegment()
is a member of the Supplier while the Scorers are the ones that should be supplying the scratch data, so we can't easily create scratch lazily. I guess we could create some new abstraction in here to handle that but it seems kind of messy.
Is there some way to know "up front" whether a memorysegment is going to be produced? If we knew that we could allocate scratch space or not based on that knowledge. I have to say I'm a little lost in this java21 MemorySegment code -- maybe @ChrisHegarty will weigh in and explain what the conditions are that lead to segmentSliceOrNull returning null?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't know during construction whether or not access to the vector data in backing segment will always be available. The main reason is that a vector may span across multiple memory segments. (one MSIndexInput can be made up of several memory segments)
This change is not right. The scratch buffers were created per supplier, since we know from the threading model that that is safe. Creating scratch buffers per scorer will be too expensive.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have another idea. maybe we just delegate the null cases to the other on-heap scorer. That might be simpler. We do something similar in the native scorer we have in Elasticsearch. I can see how this looks in the branch, if u like?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure I understand your idea, Chris, but if you want to have a go at it, by all means please do, and maybe I'll understand then :)
public RandomVectorScorer scorer(int ord) throws IOException { | ||
ByteVectorValues.Bytes vectors1 = vectorValues.vectors(); | ||
ByteVectorValues.Bytes vectors2 = vectorValues.vectors(); | ||
return new RandomVectorScorer.AbstractRandomVectorScorer(vectorValues) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would expect this to create way more garbage during HNSW graph building. The RandomVectorScorerSupplier
is passed around to the diverse checking, which will now, on each scorer that is created (which will likely be many of them for every node we add), we allocate new scratch space. Before, we had a single set of scratch space created just in the RandomVectorScorerSupplier
.
I worry this will have a measurable performance impact and hurt heap usage.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah this seems like a bad consequence. Maybe we could switch from a supplier/scorer to a mutable scorer that can be "set" to a new vector as needed?
Thanks for the insightful feedback - yeah I had been intending to do perf testing, and then got distracted by fascinating talks and kind of forgot about these concerns! Going through the code adding all these allocations I was kind of thinking most of them would be infrequent, but I agree if we are creating scorers per node that isn't going to be acceptable, so we need to find a way of sharing just enough but not too much. Anyway there's no rush to get this in, I'll take some time to dig in. |
hm there is some functional problem with the change that yields terrible recall for quantized vectors. I'll dig and fix and see if I can beef up the unit test coverage as well. |
This likely means somewhere the scratch space isn't being appropriately handled :/ |
…in Lucene90HnswVectorsReader
…in Lucene90HnswVectorsReader
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I find it much cleaner this way too.
Unrelated, we should add support for absolute reads of float[] arrays and implement these dictionaries on top of a RandomAccessInput
instead of an IndexInput
. (for a follow-up PR)
@@ -892,3 +892,7 @@ segments are rewritten either via `IndexWriter.forceMerge` or | |||
### Vector values APIs switched to primarily random-access | |||
|
|||
`{Byte/Float}VectorValues` no longer inherit from `DocIdSetIterator`. Rather they extend a common class, `KnnVectorValues`, that provides a random access API (previously provided by `RandomAccessVectorValues`, now removed), and an `iterator()` method for retrieving `DocIndexIterator`: an iterator which is a DISI that also provides an `index()` method. Therefore, any iteration over vector values must now be performed using the values' `iterator()`. Random access works as before, but does not require casting to `RandomAccessVectorValues`. | |||
|
|||
## Migration from Lucene 10.0 to Lucene 10.1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should it be at the top? This file has most recent versions at the top.
With the most recent commit I saw these luceneutil/knnPerfTest.py results: 1. baseline
this change with defaults (no command line flags)
This change with vector api enabled:
This change with vector api and enable-native-access
So I think there is some slowdown in the quantized indexing. I think we need to find a solution for the over-allocations due to having moved this logic from ScorerSupplier to Scorer. The best idea I have is to make Scorers mutable and supply them with new target vectors as needed. WDYT? |
Can you clarify which allocation is the problematic one, and where it's done on the indexing path? |
I think a "merging scorer" would be good. The only place the "scorer supplier" is used is during graph building. My initial concern with a "mutable scorer" is that it would also make the single scorer mutable, which seems weird to me. But I am happily to revisit this, especially since its blocking a nice refactor. Given that all these random scorer stuff is internal APIs, we can do whatever is best with what we have. |
Yes, OK I now see quite a bit of this is a "preexisting condition" and maybe not exacerbated by this change. We are still creating more scratch arrays than we did before though, I think, because previously we would |
Maybe we could add a |
Let's defer a double addressing scorer to follow on change, since this already getting big! I pushed a change to fix the off-heap scorer. The scratch buffers have been deferred to only when needed, and only per null on the slice, which is rare. |
FYI - I created the following issue to track the possibility of adding a scorer interface that scores one ordinal against another, without the need to create an instance of a scorer. #13966 |
I re-tested this change, including @ChrisHegarty 's latest, and it still exhibits some bug that causes terrible recall for quantized vectors. I think it is only occurring when concurrent merging is enabled, so I'll dig around in there. Separately this seems to point to more gaps in our unit test coverage. I'll open a separate issue to enhance RandomCodec so it can explore different KnnVectorsFormat settings (like concurrent merges). |
In the process of adding a randomized KnnVectorsFormat to RandomCodec I learned that 4-bit quantization is not supported for odd-length vectors !? This makes the randomized testing quite difficult since we don't have a good way of ensuring that the 4-bit codec generated randomly by trhe test framework never comes in contact with the odd-dimensional vectors randomly generated by the test case. I wonder how hard it would be to actually support this? |
It requires some logic and changes to optimized scoring. The purpose for preventing it was:
|
true - maybe we ought to ban outright? Otherwise this is kind of weird. One thing we could do is pad on the input side. I believe all our similarities would be invariant to adding an extra zero? |
Finally got back to this and fixed the aliasing that was happening. I ran some perf tests and don't see significant variance. Still, it's clear we must be doing a lot more allocations here during indexing; I plan to look at profiler output to see that. But maybe it's captured by escape analysis and we're able to re-use? I was thinking a possible solution to that could be to maintain a pool of Scorers in each ScorerSupplier and reclaim them with a close()? But maybe that can come as a follow up. this patch
mainline, Cohere, mip, 1M docs
|
hmm that commit is kind of messed up. Maybe I missed a rebase on main somewhere? I will try to clean up here, but it might require a force-push or a new PR, egad |
c35bc10
to
4284360
Compare
heap comparisonHere's the output from luceneutil's JFR heap usage summarizer. Clearly a huge amount more allocations for this change. float32 mainline
float32 knn-dictionary
int7 mainline
int7 knn-dictionary
int4 mainline
int4 knn-dictionary
|
thanks @msokolov I'll take another look at why the off-heap scorer is allocating so much. |
@ChrisHegarty I think it's expected since we would previously do the allocation once per KnnVectorValues, but now we are doing it once per RandomVectorScorer. I'm working on adding Closeable to some of these interfaces so that we can pool and reuse these objects. Maybe there's another way? |
The threading and reuse model in the current API is subtle, but works well, as it mostly minimises the potential for garbage. Looking at this again, it's almost like anywhere there was a |
Looking again in a bit more detail, I dislike that it is not possible to get access to the vector values without always creating at least one slice. Previously this was not necessary. The separation of concerns that |
I am now completely confused by this change. And I think that my last commit broke the threading model - even though all tests pass!. I'm not saying that we should not do some refactoring here, but I think that the original premise that this change is based upon is not quite right. We have the following use cases:
My understanding is that conflating access and cloning/copying into a single method means that we cannot achieve both the above use cases in an optimal way. |
OK; re allocating vectors in supplier, yes that can't work because of aliasing across threads. I also found we do not adequately test this case. There are many things in here that are not tested; I think we need to add randomization in RandomCodec as we have for other codecs (postings, docvalues) to get better coverage, and we should also explicitly be testing many of the codec options we have added (multithreaded merges, different forms of quantization). Maybe we have coverage for some of these things, but it looks to me as if our format-specific tests just choose default options.
I am coming to a similar conclusion. Although I do think this API change can be made to work if we cache the scorers in the suppliers and return them when done (I was thinking of doing this by making them closeable), this does add a lot of extra bookkeeping and I'm not sure it's going to end up being better in the end. As a side note, I see that IndexInput is Closeable -- but I don't think we ever close() the ones we allocate? Is this a problem? |
Not a problem per se. The creator of the scorers need to ensure that the primary IndexInput is closed. Clones do not need to be closed. My understanding is that all these primary index inputs are closed.
I'm not against this, but I remain to be convinced that it is worth it.
++ its separate, but I completely agree that we need better tests. |
the last commit demonstrates re-using the resource-intensive bits (RandomVectorScorer and Floats) for the floarting point case. It brings back heap allocations to be similar to what we see on main. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I recommend turning off whitespace diffs when viewing this; a bunch of it is indentation because of adding try-with-resources blocks
@@ -118,6 +123,9 @@ public Floats vectors() { | |||
public float[] get(int ord) throws IOException { | |||
return vectors.get(ord); | |||
} | |||
|
|||
@Override |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oops this is redundant
|
||
@Override | ||
public void close() throws IOException { | ||
rawVectors.close(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
these changes to the quantized reader/writer might be needed, but not just for the Floats case
Ok, this is great - we correctly understand where the allocation was coming from and have a way to avoid it. I'm still not sure that this refactor is better than the model that we have today. And the need to make the scorer closable kinda hints towards that. ( I'm not against the fact that the scorer is closable, just that it adds little value to the API and is not enforced - we don't want to enforce it's closed status ) |
Yeah I think I will stop working on this. Where I've ended up is with suspicions about the ScorerSupplier/Scorer setup. Because we allocate Scorers a lot we rely on that being a cheap operation, but ideally we should not have to create an object (actually a factory for creating objects, and then the object) in order to compare two vectors. The VectorValues knows where its vectors are, maybe it ought to do the comparison and worry about whether it needs to copy or can avoid it. Although when we repetitively compare against a known vector, that is one a case where it makes sense to me that we'd want to create an object representing that cached value-comparison. |
addresses #13831
The basic idea is to move the scratch arrays and cloned IndexInputs (generally, any stateful data) into things returned by KnnVectorValues, so that class itself no longer needs to be cloned in order to get unique sources of vectors (or scorers). ByteVectorValues and FloatVectorValues got a new
vectors()
method (this returns the "dictionary") that supports the random access. Also, RandomVectorScorer gets cloned inputs and scratch data when created rather than relying on getting these from its enclosing values instance.Naming notes:
The issue calls for a "dictionary" interface, but I found the name a bit confusing, so I undertook the following renaming: The "dictionary" interface is represented by
FloatVectorValues.Floats
andByteVectorValues.Bytes
(hearkening back to the RandomAccessVectorValues classes) and these new objects are returned by the new*VectorValues.vectors()
methods. Where these methods are called, I've changed the names of the variables storing these things tovectors
. Instance of KnnVectorValues are now mostly stored in variables calledvectorValues
; these were called various things before includingvectors
,values
, orvectorValues
. I left some calledvalues
since I didn't want to touch any more more files.I've renamed the method
vectorValue(int ord)
toget(int ord)
since there were entirely too many vectors, values, and vectorValues running around.I also ensured that
KnnVectorValues.iterator()
always returns a unique instance. Previously we had been caching in a few places and returning a shared instance, which seems like a bug, although I don't think it caused any problems given our usage.All in all it's a lot of fussy non-functional changes but I do think the clarity makes it worth doing now after ~5 years of evolution of these APIs