Skip to content

Conversation

@marianotepper
Copy link
Contributor

@marianotepper marianotepper commented Nov 3, 2025

This PR does extensive work to bring back the Fused Graph Index (FGI). In a non-fused graph, the PQ codebook of each vector in the index is stored in memory. The memory complexity is the linear in the number of vectors. FGI reduces significantly the amount of heap memory used during search by offloading the PQ codebooks to storage. These PQ codebooks are packed and stored in-line with the graph, to avoid runtime overheads resulting from this offload.

The memory complexity has two cases now:

  • When using a non-hierarchical graph, the fused graph reduces the linear memory complexity to a small constant (the number of vectors in the graph does not change this constant).
  • When using a hierarchical graph, the upper layers of the hierarchy are kept in memory an the bottom layer is in storage. The PQ codebooks of those vectors in the upper layers are kept in memory. The bottom layer behaves exactly like a non-hierarchical graph. Since the upper graph layers are sampled using a logarithmic distribution, we end up with a logarithmic memory complexity.

These savings come with a very moderate slowdown (reduction in throughput and increase in latency) of about 15%. See the results below for an example.

In this version (and in past versions), FGI only works with PQ through the FUSED_PQ feature. This feature used to be called FUSED_ADC, but to highlight the link with PQ, it has been renamed.

The routine for expanding a node (gathering its out-neighbors and computing their similarities to the query), has been pushed down to the GraphIndex views. This enables having slightly different algorithms depending on the graph layout that may be a little bit more efficient than if abstracted away in the GraphSearcher.

This PR refactors the use of SIMD instructions by FUSED PQ:

  • The old algorithm used a transposed layout similar to Quick(er)-ADC. However, that design performed the SIMD parallelization not for each codebook, but across different codebooks (thus the transpose). This parallelization was virtually impossible to combine with the skipping of the computation for previously computed similarities and implied additional computational overhead.
  • An analysis of the number of skipped similarity computations yielded about 50% skips. thus, the new algorithm simplifies the layout by storing the vectors in a non-transposed fashion, with no across-codebook parallelization. This makes it compatible with the visited checks. Additionally, it is more efficient than the non-fused approach because we have improved the locality of the PQ codebooks to gather when expanding a given graph node (in the non-fused graph this required random accesses to multiple rows of a very tall and skinny matrix that exhibit poor locality).

These SIMD changes have opened the possibility of deprecating the native vector util backend. Not effecting this deprecation in this PR because there might be another considerations to keep it around.

Edits:

  • To enable the FUSED_PQ feature, we introduced the new version 6 file format for our graph indices.

Experimental results:

Dataset: ada002-100k
Configuration:
M : 32
usePruning : true
neighborOverflow : 1.2
addHierarchy : true
efConstruction : 100

Results with topK=10

With a non-fused graph:

Overquery    Avg QPS (of 3)    ± Std Dev    CV %        Mean Latency (ms)    STD Latency (ms)    p999 Latency (ms)    Avg Visited    Avg Expanded Base Layer    Recall@10   
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------
1.00         16118.8           285.2        1.8         0.456                0.049               0.770                307.4          12.7                       0.67        
2.00         14100.7           19.5         0.1         0.507                0.064               0.802                421.6          22.9                       0.85        
5.00         11307.2           260.8        2.3         0.640                0.111               1.093                702.5          52.9                       0.94        
10.00        8335.0            88.8         1.1         0.849                0.184               1.587                1108.2         102.1                      0.97        

With a fused graph:

Overquery    Avg QPS (of 3)    ± Std Dev    CV %        Mean Latency (ms)    STD Latency (ms)    p999 Latency (ms)    Avg Visited    Avg Expanded Base Layer    Recall@10   
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------
1.00         13768.9           972.6        7.1         0.533                0.045               0.909                307.4          12.7                       0.67        
2.00         12384.5           582.9        4.7         0.586                0.057               0.966                421.6          22.9                       0.85        
5.00         9609.4            151.8        1.6         0.735                0.098               1.254                702.5          52.9                       0.94        
10.00        7135.4            162.5        2.3         0.955                0.163               1.603                1108.2         102.1                      0.97        

With the fused graph, the number of queries per second (QPS) is slowed down by less than 15% with an average of 14% and the latency by less than 17% with an average of 15%.

Results with topK=100

With a non-fused graph:

Overquery    Avg QPS (of 3)    ± Std Dev    CV %        Mean Latency (ms)    STD Latency (ms)    p999 Latency (ms)    Avg Visited    Avg Expanded Base Layer    Recall@100   
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------
1.00         8314.4            307.9        3.7         0.862                0.187               1.570                1108.2         102.1                      0.78         
2.00         5415.1            17.0         0.3         1.294                0.302               2.293                1871.1         198.0                      0.93         

With a fused graph:

Overquery    Avg QPS (of 3)    ± Std Dev    CV %        Mean Latency (ms)    STD Latency (ms)    p999 Latency (ms)    Avg Visited    Avg Expanded Base Layer    Recall@100   
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------
1.00         6744.6            303.1        4.5         0.975                0.172               1.800                1108.2         102.1                      0.78         
2.00         4550.6            113.0        2.5         1.401                0.263               2.445                1871.1         198.0                      0.93         

With the fused graph, the number of queries per second (QPS) is slowed down by 19% and 16% (overquery=1 and 2, respectively) with an average and the latency by 13% and 8% (overquery=1 and 2, respectively).

Experimental results on larger datasets

In the plots below, QPS, latency, and recall are stable (there's run-to-run variability that is intrinsic to the benchmark). Index construction time increased a bit by the process of fusing the graph on disk, which involves multiple random memory accesses for each node, and writing more to disk.

Index_Construction_Time Mean_Latency QPS Recall_at_10

… format version to 6 because of new ordering of fused features.
… FusedADC to FusedPQ for clarity. Improve function signature of OnDiskGraphIndex.View.getPackedNeighbors
… additional copy of neighbors array between OnDiskGraphIndex.View and FusedADCPQDecoder.
# Conflicts:
#	jvector-base/src/main/java/io/github/jbellis/jvector/graph/GraphIndexBuilder.java
#	jvector-base/src/main/java/io/github/jbellis/jvector/graph/ImmutableGraphIndex.java
#	jvector-base/src/main/java/io/github/jbellis/jvector/graph/OnHeapGraphIndex.java
#	jvector-base/src/main/java/io/github/jbellis/jvector/graph/disk/AbstractGraphIndexWriter.java
#	jvector-base/src/main/java/io/github/jbellis/jvector/graph/disk/OnDiskGraphIndexWriter.java
#	jvector-base/src/main/java/io/github/jbellis/jvector/graph/disk/OnDiskSequentialGraphIndexWriter.java
#	jvector-examples/src/main/java/io/github/jbellis/jvector/example/Grid.java
#	jvector-tests/src/test/java/io/github/jbellis/jvector/TestUtil.java
#	jvector-tests/src/test/java/io/github/jbellis/jvector/quantization/TestADCGraphIndex.java
@github-actions
Copy link
Contributor

github-actions bot commented Nov 3, 2025

Before you submit for review:

  • Does your PR follow guidelines from CONTRIBUTIONS.md?
  • Did you summarize what this PR does clearly and concisely?
  • Did you include performance data for changes which may be performance impacting?
  • Did you include useful docs for any user-facing changes or features?
  • Did you include useful javadocs for developer oriented changes, explaining new concepts or key changes?
  • Did you trigger and review regression testing results against the base branch via Run Bench Main?
  • Did you adhere to the code formatting guidelines (TBD)
  • Did you group your changes for easy review, providing meaningful descriptions for each commit?
  • Did you ensure that all files contain the correct copyright header?

If you did not complete any of these, then please explain below.

@marianotepper marianotepper self-assigned this Nov 3, 2025
@marianotepper marianotepper marked this pull request as ready for review November 5, 2025 13:38
@marianotepper marianotepper requested a review from jkni November 5, 2025 14:02
Copy link
Member

@michaeljmarshall michaeljmarshall left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am posting a partial review with some relatively minor suggestions. I'll revisit later today or tomorrow.

* Iterates over the neighbors of a given node if they have not been visited yet.
* For each non-visited neighbor, it computes its similarity and processes it using the given processor.
*/
void processNeighbors(int level, int node, ScoreFunction scoreFunction, Function<Integer, Boolean> visited, NeighborProcessor neighborProcessor);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we call the visited method on every node, it seems worth avoiding the autoboxing of int and boolean. Let's add an interface similar to NeighborhoodProcessor:

@FunctionalInterface
public interface IntMarker {
    boolean mark(int value);
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done!

Comment on lines 41 to 52
// we don't use Map features but EnumMap is the best way to make sure we don't
// accidentally introduce an ordering bug in the future
final EnumMap<FeatureId, Feature> featureMap;
final Map<FeatureId, Feature> featureMap;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the comment out of date or are we at risk by introducing another map type below (LinkedHashMap)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is out of date.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done!

Comment on lines 259 to 260
// There should be only one fused feature per node. This is checked in the class constructor.
for (var feature : inlineFeatures) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having a single fusedFeature variable would allow us to skip the iteration here (though I imagine the iteration is extremely fast). The primary benefit is in the class's design explicitly forcing one fused feature.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Linked with this #561 (comment)

Comment on lines -35 to +33
FUSED_ADC(FusedADC::load),
FUSED_PQ(FusedPQ::load),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a small, probably zero, chance that this feature was used in a previous version of CC/HCD. Because we are repurposing this enum position, I propose that we fail the load if the graphs version is less than 6 so we can fail predictably and the message will be to recreate the graph.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't that deprecate all previous versions? I think I'm missing something in your explanation.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC, the fused adc feature was available before on disk format 6. I am proposing that when we load the features from disk, if we have fused pq and version < 6, we throw an exception.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done!

Comment on lines 63 to 89
this.featureMap = features;
this.inlineFeatures = features.values().stream().filter(f -> !(f instanceof SeparatedFeature)).collect(Collectors.toList());

if (version <= 5) {
// Versions <= 5 use the old feature ordering, simply provided by the FeatureId
this.featureMap = features;
this.inlineFeatures = features.values().stream().filter(f -> !(f instanceof SeparatedFeature)).collect(Collectors.toList());
} else {
// Version 6 uses the new feature ordering to place fused features last in the list
var sortedFeatures = features.values().stream().sorted().collect(Collectors.toList());
this.featureMap = new LinkedHashMap<>();
for (var feature : sortedFeatures) {
this.featureMap.put(feature.id(), feature);
}
this.inlineFeatures = sortedFeatures.stream().filter(f -> !(f instanceof SeparatedFeature)).sorted().collect(Collectors.toList());
}

if (this.inlineFeatures.stream().filter(Feature::isFused).count() > 1) {
throw new IllegalArgumentException("At most one fused feature is allowed");
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could reduce the complexity here by removing the fusedFeature from the inlineFeatures list? There are a couple places we might still need the group of fused and inline features, but there are also places where we access the inline features only to get the fused feature.

What you have is already correct. My question really revolves around whether we need to modify the on disk format for the header, how we represent/access the features in this class, and if we can have just one way to order the features.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good question. Let me think about it.

Copy link
Contributor Author

@marianotepper marianotepper Nov 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is only one place in the code where we need that unique fusedFeature (when in AbstractGraphIndexWriter.writeSparseLevels). I refactored the function so that the iteration happens in an outer loop and we create a local fusedFeature variable. I also added a comment saying that this local variable should be turned into a class member if needed elsewhere.

Copy link
Contributor Author

@marianotepper marianotepper Nov 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Conceptually, I like the idea of introducing an ordering of the features. Of course, this is coupled with how things are handled in OnDiskGraphIndex.View.getPackedNeighbors, for example. I see that this coupling is somewhat unfortunate and it looks like the immediate solution is to not have that intrinsic ordering and just hardcode it in AbstractGraphIndexWriter. I am somewhat torn between a design that has more potential but may not be fully realized and one that is more constrained.

Comment on lines -83 to +90
/** For layers > 0, store adjacency fully in memory. */
// For layers > 0, store adjacency fully in memory.
private final AtomicReference<List<Int2ObjectHashMap<int[]>>> inMemoryNeighbors;
// When using fused features, store the features fully in memory for layers > 0
private final AtomicReference<Int2ObjectHashMap<FusedFeature.InlineSource>> inMemoryFeatures;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It appears that these objects are not counted in the graph's ramBytesUsed() computations.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment is linked with #561 (comment). In order to give a correct estimation of the heap usage, we need to either load these structures at initialization or get the information from the header. I think it just males sense to to the former and solve this once and for all.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done!

EnumSet.of(FeatureId.NVQ_VECTORS),
// EnumSet.of(FeatureId.NVQ_VECTORS, FeatureId.FUSED_ADC),
// EnumSet.of(FeatureId.NVQ_VECTORS, FeatureId.FUSED_PQ),
EnumSet.of(FeatureId.INLINE_VECTORS)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to uncomment this now that the feature is implemented?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should in reality deprecate Bench altogether. BenchYAML has become a much more handy point of entry. But I agree, I will uncomment this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done!


// In V6, fused features for the in-memory hierarchy are written in a block after the top layers of the graph.
// Since everything in level 1 is also contained in the high-levels, we only need to write the fused features for level 1.
if (version == 6) {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this strictly version == 6 or should it be version >= 6 like in another places in the code ?

Copy link
Contributor Author

@marianotepper marianotepper Nov 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is a good observation. The rule I followed is that subsequent versions might change this behavior anyway, so there's no benefit in expressing expectations over future versions in the code.

if (common.version >= 3) {
out.writeInt(FeatureId.serialize(EnumSet.copyOf(features.keySet())));
}
if (common.version >= 6) {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about creating a class with constants for all the version ?
so that we can document there which features are available in which version, like we do for SAI:
https://github.com/datastax/cassandra/blob/82064a0255715c23dec03c5a94be122ac1ddc09e/src/java/org/apache/cassandra/index/sai/disk/format/Version.java#L57-L77

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I fully get the benefit of such a change in JVector. Looks like we are dealing with a less combinatorial problem here. Can elaborate on the potential benefits of this approach? I'm open to consider it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants