Bring back the fused graph index #561

marianotepper · 2025-11-03T13:20:26Z

This PR does extensive work to bring back the Fused Graph Index (FGI). In a non-fused graph, the PQ codebook of each vector in the index is stored in memory. The memory complexity is the linear in the number of vectors. FGI reduces significantly the amount of heap memory used during search by offloading the PQ codebooks to storage. These PQ codebooks are packed and stored in-line with the graph, to avoid runtime overheads resulting from this offload.

The memory complexity has two cases now:

When using a non-hierarchical graph, the fused graph reduces the linear memory complexity to a small constant (the number of vectors in the graph does not change this constant).
When using a hierarchical graph, the upper layers of the hierarchy are kept in memory an the bottom layer is in storage. The PQ codebooks of those vectors in the upper layers are kept in memory. The bottom layer behaves exactly like a non-hierarchical graph. Since the upper graph layers are sampled using a logarithmic distribution, we end up with a logarithmic memory complexity.

These savings come with a very moderate slowdown (reduction in throughput and increase in latency) of about 15%. See the results below for an example.

In this version (and in past versions), FGI only works with PQ through the FUSED_PQ feature. This feature used to be called FUSED_ADC, but to highlight the link with PQ, it has been renamed.

The routine for expanding a node (gathering its out-neighbors and computing their similarities to the query), has been pushed down to the GraphIndex views. This enables having slightly different algorithms depending on the graph layout that may be a little bit more efficient than if abstracted away in the GraphSearcher.

This PR refactors the use of SIMD instructions by FUSED PQ:

The old algorithm used a transposed layout similar to Quick(er)-ADC. However, that design performed the SIMD parallelization not for each codebook, but across different codebooks (thus the transpose). This parallelization was virtually impossible to combine with the skipping of the computation for previously computed similarities and implied additional computational overhead.
An analysis of the number of skipped similarity computations yielded about 50% skips. thus, the new algorithm simplifies the layout by storing the vectors in a non-transposed fashion, with no across-codebook parallelization. This makes it compatible with the visited checks. Additionally, it is more efficient than the non-fused approach because we have improved the locality of the PQ codebooks to gather when expanding a given graph node (in the non-fused graph this required random accesses to multiple rows of a very tall and skinny matrix that exhibit poor locality).

These SIMD changes have opened the possibility of deprecating the native vector util backend. Not effecting this deprecation in this PR because there might be another considerations to keep it around.

Edits:

To enable the FUSED_PQ feature, we introduced the new version 6 file format for our graph indices.

Experimental results:

Dataset: ada002-100k
Configuration:
M : 32
usePruning : true
neighborOverflow : 1.2
addHierarchy : true
efConstruction : 100

Results with topK=10

With a non-fused graph:

Overquery    Avg QPS (of 3)    ± Std Dev    CV %        Mean Latency (ms)    STD Latency (ms)    p999 Latency (ms)    Avg Visited    Avg Expanded Base Layer    Recall@10   
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------
1.00         16118.8           285.2        1.8         0.456                0.049               0.770                307.4          12.7                       0.67        
2.00         14100.7           19.5         0.1         0.507                0.064               0.802                421.6          22.9                       0.85        
5.00         11307.2           260.8        2.3         0.640                0.111               1.093                702.5          52.9                       0.94        
10.00        8335.0            88.8         1.1         0.849                0.184               1.587                1108.2         102.1                      0.97

With a fused graph:

Overquery    Avg QPS (of 3)    ± Std Dev    CV %        Mean Latency (ms)    STD Latency (ms)    p999 Latency (ms)    Avg Visited    Avg Expanded Base Layer    Recall@10   
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------
1.00         13768.9           972.6        7.1         0.533                0.045               0.909                307.4          12.7                       0.67        
2.00         12384.5           582.9        4.7         0.586                0.057               0.966                421.6          22.9                       0.85        
5.00         9609.4            151.8        1.6         0.735                0.098               1.254                702.5          52.9                       0.94        
10.00        7135.4            162.5        2.3         0.955                0.163               1.603                1108.2         102.1                      0.97

With the fused graph, the number of queries per second (QPS) is slowed down by less than 15% with an average of 14% and the latency by less than 17% with an average of 15%.

Results with topK=100

With a non-fused graph:

Overquery    Avg QPS (of 3)    ± Std Dev    CV %        Mean Latency (ms)    STD Latency (ms)    p999 Latency (ms)    Avg Visited    Avg Expanded Base Layer    Recall@100   
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------
1.00         8314.4            307.9        3.7         0.862                0.187               1.570                1108.2         102.1                      0.78         
2.00         5415.1            17.0         0.3         1.294                0.302               2.293                1871.1         198.0                      0.93

With a fused graph:

Overquery    Avg QPS (of 3)    ± Std Dev    CV %        Mean Latency (ms)    STD Latency (ms)    p999 Latency (ms)    Avg Visited    Avg Expanded Base Layer    Recall@100   
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------
1.00         6744.6            303.1        4.5         0.975                0.172               1.800                1108.2         102.1                      0.78         
2.00         4550.6            113.0        2.5         1.401                0.263               2.445                1871.1         198.0                      0.93

With the fused graph, the number of queries per second (QPS) is slowed down by 19% and 16% (overquery=1 and 2, respectively) with an average and the latency by 13% and 8% (overquery=1 and 2, respectively).

Experimental results on larger datasets

In the plots below, QPS, latency, and recall are stable (there's run-to-run variability that is intrinsic to the benchmark). Index construction time increased a bit by the process of fusing the graph on disk, which involves multiple random memory accesses for each node, and writing more to disk.

…dADC

… format version to 6 because of new ordering of fused features.

… FusedADC to FusedPQ for clarity. Improve function signature of OnDiskGraphIndex.View.getPackedNeighbors

… additional copy of neighbors array between OnDiskGraphIndex.View and FusedADCPQDecoder.

# Conflicts: # jvector-base/src/main/java/io/github/jbellis/jvector/graph/GraphIndexBuilder.java # jvector-base/src/main/java/io/github/jbellis/jvector/graph/ImmutableGraphIndex.java # jvector-base/src/main/java/io/github/jbellis/jvector/graph/OnHeapGraphIndex.java # jvector-base/src/main/java/io/github/jbellis/jvector/graph/disk/AbstractGraphIndexWriter.java # jvector-base/src/main/java/io/github/jbellis/jvector/graph/disk/OnDiskGraphIndexWriter.java # jvector-base/src/main/java/io/github/jbellis/jvector/graph/disk/OnDiskSequentialGraphIndexWriter.java # jvector-examples/src/main/java/io/github/jbellis/jvector/example/Grid.java # jvector-tests/src/test/java/io/github/jbellis/jvector/TestUtil.java # jvector-tests/src/test/java/io/github/jbellis/jvector/quantization/TestADCGraphIndex.java

…s in testRecallOnGraphWithRandomVectors

…and 256

…ad of Quick(er) ADC.

…ial state. NativeVectorUtilSupport is stripped down almost entirely.

# Conflicts: # jvector-base/src/main/java/io/github/jbellis/jvector/graph/disk/OnDiskGraphIndex.java

github-actions · 2025-11-03T13:20:37Z

Before you submit for review:

Does your PR follow guidelines from CONTRIBUTIONS.md?
Did you summarize what this PR does clearly and concisely?
Did you include performance data for changes which may be performance impacting?
Did you include useful docs for any user-facing changes or features?
Did you include useful javadocs for developer oriented changes, explaining new concepts or key changes?
Did you trigger and review regression testing results against the base branch via Run Bench Main?
Did you adhere to the code formatting guidelines (TBD)
Did you group your changes for easy review, providing meaningful descriptions for each commit?
Did you ensure that all files contain the correct copyright header?

If you did not complete any of these, then please explain below.

…verything works when using the native backend on machines with AVX512.

…12 so that everything works when using the native backend on machines with AVX512.

…a its own binary selector now.

michaeljmarshall

I am posting a partial review with some relatively minor suggestions. I'll revisit later today or tomorrow.

UPGRADING.md

michaeljmarshall · 2025-11-05T16:28:35Z

jvector-base/src/main/java/io/github/jbellis/jvector/graph/ImmutableGraphIndex.java

+         * Iterates over the neighbors of a given node if they have not been visited yet.
+         * For each non-visited neighbor, it computes its similarity and processes it using the given processor.
+         */
+        void processNeighbors(int level, int node, ScoreFunction scoreFunction, Function<Integer, Boolean> visited, NeighborProcessor neighborProcessor);


Since we call the visited method on every node, it seems worth avoiding the autoboxing of int and boolean. Let's add an interface similar to NeighborhoodProcessor:

@FunctionalInterface public interface IntMarker { boolean mark(int value); }

michaeljmarshall · 2025-11-05T16:36:28Z

jvector-base/src/main/java/io/github/jbellis/jvector/graph/disk/AbstractGraphIndexWriter.java

    // we don't use Map features but EnumMap is the best way to make sure we don't
    // accidentally introduce an ordering bug in the future
-    final EnumMap<FeatureId, Feature> featureMap;
+    final Map<FeatureId, Feature> featureMap;


Is the comment out of date or are we at risk by introducing another map type below (LinkedHashMap)?

It is out of date.

michaeljmarshall · 2025-11-05T17:18:13Z

jvector-base/src/main/java/io/github/jbellis/jvector/graph/disk/AbstractGraphIndexWriter.java

+                // There should be only one fused feature per node. This is checked in the class constructor.
+                for (var feature : inlineFeatures) {


Having a single fusedFeature variable would allow us to skip the iteration here (though I imagine the iteration is extremely fast). The primary benefit is in the class's design explicitly forcing one fused feature.

Linked with this #561 (comment)

michaeljmarshall · 2025-11-05T17:52:12Z

jvector-base/src/main/java/io/github/jbellis/jvector/graph/disk/feature/FeatureId.java

-    FUSED_ADC(FusedADC::load),
+    FUSED_PQ(FusedPQ::load),


There is a small, probably zero, chance that this feature was used in a previous version of CC/HCD. Because we are repurposing this enum position, I propose that we fail the load if the graphs version is less than 6 so we can fail predictably and the message will be to recreate the graph.

Wouldn't that deprecate all previous versions? I think I'm missing something in your explanation.

IIUC, the fused adc feature was available before on disk format 6. I am proposing that when we load the features from disk, if we have fused pq and version < 6, we throw an exception.

michaeljmarshall · 2025-11-05T18:00:49Z

jvector-base/src/main/java/io/github/jbellis/jvector/graph/disk/AbstractGraphIndexWriter.java

-        this.featureMap = features;
-        this.inlineFeatures = features.values().stream().filter(f -> !(f instanceof SeparatedFeature)).collect(Collectors.toList());
+
+        if (version <= 5) {
+            // Versions <= 5 use the old feature ordering, simply provided by the FeatureId
+            this.featureMap = features;
+            this.inlineFeatures = features.values().stream().filter(f -> !(f instanceof SeparatedFeature)).collect(Collectors.toList());
+        } else {
+            // Version 6 uses the new feature ordering to place fused features last in the list
+            var sortedFeatures = features.values().stream().sorted().collect(Collectors.toList());
+            this.featureMap = new LinkedHashMap<>();
+            for (var feature : sortedFeatures) {
+                this.featureMap.put(feature.id(), feature);
+            }
+            this.inlineFeatures = sortedFeatures.stream().filter(f -> !(f instanceof SeparatedFeature)).sorted().collect(Collectors.toList());
+        }
+
+        if (this.inlineFeatures.stream().filter(Feature::isFused).count() > 1) {
+            throw new IllegalArgumentException("At most one fused feature is allowed");
+        }


Could reduce the complexity here by removing the fusedFeature from the inlineFeatures list? There are a couple places we might still need the group of fused and inline features, but there are also places where we access the inline features only to get the fused feature.

What you have is already correct. My question really revolves around whether we need to modify the on disk format for the header, how we represent/access the features in this class, and if we can have just one way to order the features.

This is a good question. Let me think about it.

There is only one place in the code where we need that unique fusedFeature (when in AbstractGraphIndexWriter.writeSparseLevels). I refactored the function so that the iteration happens in an outer loop and we create a local fusedFeature variable. I also added a comment saying that this local variable should be turned into a class member if needed elsewhere.

Conceptually, I like the idea of introducing an ordering of the features. Of course, this is coupled with how things are handled in OnDiskGraphIndex.View.getPackedNeighbors, for example. I see that this coupling is somewhat unfortunate and it looks like the immediate solution is to not have that intrinsic ordering and just hardcode it in AbstractGraphIndexWriter. I am somewhat torn between a design that has more potential but may not be fully realized and one that is more constrained.

jvector-base/src/main/java/io/github/jbellis/jvector/graph/disk/OnDiskGraphIndex.java

michaeljmarshall · 2025-11-05T18:13:52Z

jvector-base/src/main/java/io/github/jbellis/jvector/graph/disk/OnDiskGraphIndex.java

-    /** For layers > 0, store adjacency fully in memory. */
+    // For layers > 0, store adjacency fully in memory.
    private final AtomicReference<List<Int2ObjectHashMap<int[]>>> inMemoryNeighbors;
+    // When using fused features, store the features fully in memory for layers > 0
+    private final AtomicReference<Int2ObjectHashMap<FusedFeature.InlineSource>> inMemoryFeatures;


It appears that these objects are not counted in the graph's ramBytesUsed() computations.

Good catch!

This comment is linked with #561 (comment). In order to give a correct estimation of the heap usage, we need to either load these structures at initialization or get the information from the header. I think it just males sense to to the former and solve this once and for all.

michaeljmarshall · 2025-11-05T18:29:13Z

jvector-examples/src/main/java/io/github/jbellis/jvector/example/Bench.java

                EnumSet.of(FeatureId.NVQ_VECTORS),
-//                EnumSet.of(FeatureId.NVQ_VECTORS, FeatureId.FUSED_ADC),
+//                EnumSet.of(FeatureId.NVQ_VECTORS, FeatureId.FUSED_PQ),
                EnumSet.of(FeatureId.INLINE_VECTORS)


Do we want to uncomment this now that the feature is implemented?

We should in reality deprecate Bench altogether. BenchYAML has become a much more handy point of entry. But I agree, I will uncomment this.

…o that ramBytesUsed can be computed.

eolivelli · 2025-11-06T04:53:13Z

jvector-base/src/main/java/io/github/jbellis/jvector/graph/disk/AbstractGraphIndexWriter.java

+
+        // In V6, fused features for the in-memory hierarchy are written in a block after the top layers of the graph.
+        // Since everything in level 1 is also contained in the high-levels, we only need to write the fused features for level 1.
+        if (version == 6) {


is this strictly version == 6 or should it be version >= 6 like in another places in the code ?

It is a good observation. The rule I followed is that subsequent versions might change this behavior anyway, so there's no benefit in expressing expectations over future versions in the code.

eolivelli · 2025-11-06T04:54:45Z

jvector-base/src/main/java/io/github/jbellis/jvector/graph/disk/Header.java

-        if (common.version >= 3) {
-            out.writeInt(FeatureId.serialize(EnumSet.copyOf(features.keySet())));
-        }
+        if (common.version >= 6) {


What about creating a class with constants for all the version ?
so that we can document there which features are available in which version, like we do for SAI:
https://github.com/datastax/cassandra/blob/82064a0255715c23dec03c5a94be122ac1ddc09e/src/java/org/apache/cassandra/index/sai/disk/format/Version.java#L57-L77

I'm not sure I fully get the benefit of such a change in JVector. Looks like we are dealing with a less combinatorial problem here. Can elaborate on the potential benefits of this approach? I'm open to consider it.

…isited nodes

… is not fused

marianotepper added 30 commits September 2, 2025 15:51

Enable specifying the benchmarks in the yaml file

9d191a4

Use the original feature set

2a76e39

Enable FUSED feature in yaml files

1b16d25

Temporary yaml file for ada-002

8f514d8

Fix out-of-sync neighbor iterator and vector of similarities for Fuse…

f315b61

…dADC

Initial accuracy test for FusedADC

7246a38

Updating the code towards efficient storage access

cc9cfc8

Fused code is working but the hierarchy still has issues. Bumped file…

b633d5d

… format version to 6 because of new ordering of fused features.

throw an exception if more than one fused feature is used.

0177c43

Fully working FusedADC branch

670a97d

Update new file format version in yaml files

0dded51

Fix features sorting with NVQ and add the corresponding tests. Rename…

4179de7

… FusedADC to FusedPQ for clarity. Improve function signature of OnDiskGraphIndex.View.getPackedNeighbors

Uncomment latency computation in ada002-100k.yml

f0a9e4f

Cleanup signature of OnDiskGraphIndex.View.getPackedNeighbors. Reduce…

fc0aa88

… additional copy of neighbors array between OnDiskGraphIndex.View and FusedADCPQDecoder.

Improvements in code clarity

e0cb196

Remove duplicated code

c0b53a4

Refactor repeated code for simplicity.

e5fbf1d

Rename class FusedADCPQDecoder into FusedPQDecoderfor simplicity

0300fda

Use "level" when referring to integers instead of "layer"

ddbf06d

Add missing whitespace

14a879f

Fix imports

459c63c

Merged changes from main

698ebcd

Fix indentation

ca0cd79

Add ability to remap ordinals in TestUtil.writeFusedGraph

fd1bb4a

Add renumbering tests to TestFusedGraphIndex. Performance improvement…

450b0eb

…s in testRecallOnGraphWithRandomVectors

Merge branch 'main' into revive-fused-adc

ae9416b

Add support in PanamaVectorUtilSupport.quantizePartials for SIMD 128 …

7e4b118

…and 256

Basic implementation of SIMD code for FADC. Not tested yet

2c2e9f3

Basic implementation of SIMD code for FADC. Not tested yet

48735b9

marianotepper added 6 commits October 30, 2025 11:11

Enable topK=100 in ada002-100k.yml

69b4a67

Overhaul of FusedPQ computations to use the traditional PQ path inste…

4b6bcb0

…ad of Quick(er) ADC.

Native is not needed for FusedPQ, backtracking the module to its init…

f22ef4b

…ial state. NativeVectorUtilSupport is stripped down almost entirely.

Minor fixes to remove comments and mentions to Fused ADC

c4a2304

Add descriptions in README.md and UPGRADING.md

1b87234

Merge branch 'main' into revive-fused-adc

b2117b2

# Conflicts: # jvector-base/src/main/java/io/github/jbellis/jvector/graph/disk/OnDiskGraphIndex.java

marianotepper requested review from michaeljmarshall and sam-herman November 3, 2025 13:39

marianotepper added 5 commits November 3, 2025 05:40

Adding missing licenses

d17838d

Add back the call to NativeSimdOps.assemble_and_sum_f32_512 so that e…

dfc38bf

…verything works when using the native backend on machines with AVX512.

Add back the call to NativeSimdOps.pq_decoded_cosine_similarity_f32_5…

600040b

…12 so that everything works when using the native backend on machines with AVX512.

Enable all options in ada002-100k.yml

c02d731

Improve the way that the fused option is used in the YML files. It hs…

75d7822

…a its own binary selector now.

marianotepper self-assigned this Nov 3, 2025

marianotepper marked this pull request as ready for review November 5, 2025 13:38

marianotepper requested review from MarkWolters, jshook and tlwillke as code owners November 5, 2025 13:38

marianotepper requested a review from jkni November 5, 2025 14:02

michaeljmarshall reviewed Nov 5, 2025

View reviewed changes

marianotepper added 5 commits November 5, 2025 12:46

Uncomment FeatureId.FUSED_PQ in Bench.java

74d5f3e

Make FusedFeature.InlineSource extend Accountable

c9b902e

Remove stale comment

3dc9210

Load the upper layers of the hierarchy when constructing the object s…

7f07216

…o that ramBytesUsed can be computed.

Specify that FUSED_PQ only works with v6 file format in UPGRADING.md

1212f00

eolivelli reviewed Nov 6, 2025

View reviewed changes

marianotepper added 3 commits November 6, 2025 04:13

Streamline AbstractGraphIndexWriter.writeSeparatedFeatures

0a86e50

Switch from Function<Integer, Boolean> to IntMarker for the marking v…

afc269d

…isited nodes

Do not load the in-memory features for the fused graph when the graph…

7e9d468

… is not fused

		// There should be only one fused feature per node. This is checked in the class constructor.
		for (var feature : inlineFeatures) {

Bring back the fused graph index #561

Are you sure you want to change the base?

Bring back the fused graph index #561

Uh oh!

Conversation

marianotepper commented Nov 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Edits:

Experimental results:

Results with topK=10

Results with topK=100

Experimental results on larger datasets

Uh oh!

github-actions bot commented Nov 3, 2025 • edited by marianotepper Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

michaeljmarshall left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

marianotepper Nov 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

marianotepper Nov 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

marianotepper Nov 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

marianotepper commented Nov 3, 2025 •

edited

Loading

github-actions bot commented Nov 3, 2025 •

edited by marianotepper

Loading

marianotepper Nov 6, 2025 •

edited

Loading

marianotepper Nov 6, 2025 •

edited

Loading

marianotepper Nov 6, 2025 •

edited

Loading