-
Notifications
You must be signed in to change notification settings - Fork 140
Bring back the fused graph index #561
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
… format version to 6 because of new ordering of fused features.
… FusedADC to FusedPQ for clarity. Improve function signature of OnDiskGraphIndex.View.getPackedNeighbors
… additional copy of neighbors array between OnDiskGraphIndex.View and FusedADCPQDecoder.
# Conflicts: # jvector-base/src/main/java/io/github/jbellis/jvector/graph/GraphIndexBuilder.java # jvector-base/src/main/java/io/github/jbellis/jvector/graph/ImmutableGraphIndex.java # jvector-base/src/main/java/io/github/jbellis/jvector/graph/OnHeapGraphIndex.java # jvector-base/src/main/java/io/github/jbellis/jvector/graph/disk/AbstractGraphIndexWriter.java # jvector-base/src/main/java/io/github/jbellis/jvector/graph/disk/OnDiskGraphIndexWriter.java # jvector-base/src/main/java/io/github/jbellis/jvector/graph/disk/OnDiskSequentialGraphIndexWriter.java # jvector-examples/src/main/java/io/github/jbellis/jvector/example/Grid.java # jvector-tests/src/test/java/io/github/jbellis/jvector/TestUtil.java # jvector-tests/src/test/java/io/github/jbellis/jvector/quantization/TestADCGraphIndex.java
…s in testRecallOnGraphWithRandomVectors
…ad of Quick(er) ADC.
…ial state. NativeVectorUtilSupport is stripped down almost entirely.
# Conflicts: # jvector-base/src/main/java/io/github/jbellis/jvector/graph/disk/OnDiskGraphIndex.java
|
Before you submit for review:
If you did not complete any of these, then please explain below. |
…verything works when using the native backend on machines with AVX512.
…12 so that everything works when using the native backend on machines with AVX512.
…a its own binary selector now.
michaeljmarshall
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am posting a partial review with some relatively minor suggestions. I'll revisit later today or tomorrow.
| * Iterates over the neighbors of a given node if they have not been visited yet. | ||
| * For each non-visited neighbor, it computes its similarity and processes it using the given processor. | ||
| */ | ||
| void processNeighbors(int level, int node, ScoreFunction scoreFunction, Function<Integer, Boolean> visited, NeighborProcessor neighborProcessor); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we call the visited method on every node, it seems worth avoiding the autoboxing of int and boolean. Let's add an interface similar to NeighborhoodProcessor:
@FunctionalInterface
public interface IntMarker {
boolean mark(int value);
}There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done!
| // we don't use Map features but EnumMap is the best way to make sure we don't | ||
| // accidentally introduce an ordering bug in the future | ||
| final EnumMap<FeatureId, Feature> featureMap; | ||
| final Map<FeatureId, Feature> featureMap; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the comment out of date or are we at risk by introducing another map type below (LinkedHashMap)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is out of date.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done!
| // There should be only one fused feature per node. This is checked in the class constructor. | ||
| for (var feature : inlineFeatures) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Having a single fusedFeature variable would allow us to skip the iteration here (though I imagine the iteration is extremely fast). The primary benefit is in the class's design explicitly forcing one fused feature.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Linked with this #561 (comment)
| FUSED_ADC(FusedADC::load), | ||
| FUSED_PQ(FusedPQ::load), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a small, probably zero, chance that this feature was used in a previous version of CC/HCD. Because we are repurposing this enum position, I propose that we fail the load if the graphs version is less than 6 so we can fail predictably and the message will be to recreate the graph.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wouldn't that deprecate all previous versions? I think I'm missing something in your explanation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIUC, the fused adc feature was available before on disk format 6. I am proposing that when we load the features from disk, if we have fused pq and version < 6, we throw an exception.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done!
| this.featureMap = features; | ||
| this.inlineFeatures = features.values().stream().filter(f -> !(f instanceof SeparatedFeature)).collect(Collectors.toList()); | ||
|
|
||
| if (version <= 5) { | ||
| // Versions <= 5 use the old feature ordering, simply provided by the FeatureId | ||
| this.featureMap = features; | ||
| this.inlineFeatures = features.values().stream().filter(f -> !(f instanceof SeparatedFeature)).collect(Collectors.toList()); | ||
| } else { | ||
| // Version 6 uses the new feature ordering to place fused features last in the list | ||
| var sortedFeatures = features.values().stream().sorted().collect(Collectors.toList()); | ||
| this.featureMap = new LinkedHashMap<>(); | ||
| for (var feature : sortedFeatures) { | ||
| this.featureMap.put(feature.id(), feature); | ||
| } | ||
| this.inlineFeatures = sortedFeatures.stream().filter(f -> !(f instanceof SeparatedFeature)).sorted().collect(Collectors.toList()); | ||
| } | ||
|
|
||
| if (this.inlineFeatures.stream().filter(Feature::isFused).count() > 1) { | ||
| throw new IllegalArgumentException("At most one fused feature is allowed"); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could reduce the complexity here by removing the fusedFeature from the inlineFeatures list? There are a couple places we might still need the group of fused and inline features, but there are also places where we access the inline features only to get the fused feature.
What you have is already correct. My question really revolves around whether we need to modify the on disk format for the header, how we represent/access the features in this class, and if we can have just one way to order the features.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a good question. Let me think about it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is only one place in the code where we need that unique fusedFeature (when in AbstractGraphIndexWriter.writeSparseLevels). I refactored the function so that the iteration happens in an outer loop and we create a local fusedFeature variable. I also added a comment saying that this local variable should be turned into a class member if needed elsewhere.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Conceptually, I like the idea of introducing an ordering of the features. Of course, this is coupled with how things are handled in OnDiskGraphIndex.View.getPackedNeighbors, for example. I see that this coupling is somewhat unfortunate and it looks like the immediate solution is to not have that intrinsic ordering and just hardcode it in AbstractGraphIndexWriter. I am somewhat torn between a design that has more potential but may not be fully realized and one that is more constrained.
jvector-base/src/main/java/io/github/jbellis/jvector/graph/disk/OnDiskGraphIndex.java
Show resolved
Hide resolved
| /** For layers > 0, store adjacency fully in memory. */ | ||
| // For layers > 0, store adjacency fully in memory. | ||
| private final AtomicReference<List<Int2ObjectHashMap<int[]>>> inMemoryNeighbors; | ||
| // When using fused features, store the features fully in memory for layers > 0 | ||
| private final AtomicReference<Int2ObjectHashMap<FusedFeature.InlineSource>> inMemoryFeatures; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It appears that these objects are not counted in the graph's ramBytesUsed() computations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comment is linked with #561 (comment). In order to give a correct estimation of the heap usage, we need to either load these structures at initialization or get the information from the header. I think it just males sense to to the former and solve this once and for all.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done!
| EnumSet.of(FeatureId.NVQ_VECTORS), | ||
| // EnumSet.of(FeatureId.NVQ_VECTORS, FeatureId.FUSED_ADC), | ||
| // EnumSet.of(FeatureId.NVQ_VECTORS, FeatureId.FUSED_PQ), | ||
| EnumSet.of(FeatureId.INLINE_VECTORS) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want to uncomment this now that the feature is implemented?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should in reality deprecate Bench altogether. BenchYAML has become a much more handy point of entry. But I agree, I will uncomment this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done!
…o that ramBytesUsed can be computed.
|
|
||
| // In V6, fused features for the in-memory hierarchy are written in a block after the top layers of the graph. | ||
| // Since everything in level 1 is also contained in the high-levels, we only need to write the fused features for level 1. | ||
| if (version == 6) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this strictly version == 6 or should it be version >= 6 like in another places in the code ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is a good observation. The rule I followed is that subsequent versions might change this behavior anyway, so there's no benefit in expressing expectations over future versions in the code.
| if (common.version >= 3) { | ||
| out.writeInt(FeatureId.serialize(EnumSet.copyOf(features.keySet()))); | ||
| } | ||
| if (common.version >= 6) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about creating a class with constants for all the version ?
so that we can document there which features are available in which version, like we do for SAI:
https://github.com/datastax/cassandra/blob/82064a0255715c23dec03c5a94be122ac1ddc09e/src/java/org/apache/cassandra/index/sai/disk/format/Version.java#L57-L77
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure I fully get the benefit of such a change in JVector. Looks like we are dealing with a less combinatorial problem here. Can elaborate on the potential benefits of this approach? I'm open to consider it.
This PR does extensive work to bring back the Fused Graph Index (FGI). In a non-fused graph, the PQ codebook of each vector in the index is stored in memory. The memory complexity is the linear in the number of vectors. FGI reduces significantly the amount of heap memory used during search by offloading the PQ codebooks to storage. These PQ codebooks are packed and stored in-line with the graph, to avoid runtime overheads resulting from this offload.
The memory complexity has two cases now:
These savings come with a very moderate slowdown (reduction in throughput and increase in latency) of about 15%. See the results below for an example.
In this version (and in past versions), FGI only works with PQ through the FUSED_PQ feature. This feature used to be called FUSED_ADC, but to highlight the link with PQ, it has been renamed.
The routine for expanding a node (gathering its out-neighbors and computing their similarities to the query), has been pushed down to the GraphIndex views. This enables having slightly different algorithms depending on the graph layout that may be a little bit more efficient than if abstracted away in the GraphSearcher.
This PR refactors the use of SIMD instructions by FUSED PQ:
These SIMD changes have opened the possibility of deprecating the native vector util backend. Not effecting this deprecation in this PR because there might be another considerations to keep it around.
Edits:
Experimental results:
Dataset: ada002-100k
Configuration:
M : 32
usePruning : true
neighborOverflow : 1.2
addHierarchy : true
efConstruction : 100
Results with topK=10
With a non-fused graph:
With a fused graph:
With the fused graph, the number of queries per second (QPS) is slowed down by less than 15% with an average of 14% and the latency by less than 17% with an average of 15%.
Results with topK=100
With a non-fused graph:
With a fused graph:
With the fused graph, the number of queries per second (QPS) is slowed down by 19% and 16% (overquery=1 and 2, respectively) with an average and the latency by 13% and 8% (overquery=1 and 2, respectively).
Experimental results on larger datasets
In the plots below, QPS, latency, and recall are stable (there's run-to-run variability that is intrinsic to the benchmark). Index construction time increased a bit by the process of fusing the graph on disk, which involves multiple random memory accesses for each node, and writing more to disk.