Add HNSW support to the localhost API interface #2691

vincent-4 · 2025-01-23T18:42:35Z

Addresses #2688

Keep BM25 as base case, treat HNSW as extension to maintain backward compatibility

Example usage:
curl -X GET 'http://localhost:8080/api/v1.0/indexes/msmarco-v1-passage.bge-base-en-v1.5.hnsw/search?query=How%20does%20the%20process%20of%20digestion%20and%20metabolism%20of%20carbohydrates%20start&hits=10&efSearch=128&encoder=BgeBaseEn15&queryGenerator=VectorQueryGenerator'

Although would a different API querying method be better? We specify the encoder and generator but without the additional params, (user specified encoder and querygen) it won't work. But the existing usage is not changed, eg
curl -X GET "http://localhost:8081/api/v1.0/indexes/msmarco-v1-passage.bge-base-en-v1.5.hnsw/search?query=How%20does%20the%20process%20of%20digestion%20and%20metabolism%20of%20carbohydrates%20start" is fine.

Example usage: curl -X GET 'http://localhost:8080/api/v1.0/indexes/msmarco-v1-passage.bge-base-en-v1.5.hnsw/search?query=How%20does%20the%20process%20of%20digestion%20and%20metabolism%20of%20carbohydrates%20start&hits=10&efSearch=128&encoder=BgeBaseEn15&queryGenerator=VectorQueryGenerator'

refactor method

lintool · 2025-01-23T18:44:57Z

src/main/java/io/anserini/server/ControllerV1_0.java

-      if (encoder == null || queryGenerator == null) {
-        throw new IllegalArgumentException("HNSW indexes require both 'encoder' and 'queryGenerator' parameters");
-      }
+    if (index.contains(".hnsw") && (encoder == null || queryGenerator == null)) {


Why don't we add a new field in the Enum in IndexInfo so we don't have to do this janky check?

Indexes need to be paired with generators and encoders, right? so put that in IndexInfo?

Why don't we add a new field in the Enum in IndexInfo so we don't have to do this janky check?

For each enum? Or something different

Sure, why not? Add another field in the enum denoting indexing type, the encoder, and the query generator?

lintool · 2025-01-23T18:51:38Z

Another thought: maybe have defaults hard-coded in IndexInfo, and then use api/v1.0/indexes/msmarco-v1-passage.bge-base-en-v1.5.hnsw/setting?param1=value1&param2=value2 to set?

Something like that?

vincent-4 · 2025-01-23T21:04:03Z

Another thought: maybe have defaults hard-coded in IndexInfo, and then use api/v1.0/indexes/msmarco-v1-passage.bge-base-en-v1.5.hnsw/setting?param1=value1&param2=value2 to set?

Something like that?

I guess we could have it per-request or make it index-level (persistent). Does the usual use case make it easier to work with the first or second approach? Or could add both possibilities but more complex

lintool · 2025-01-23T21:15:38Z

per-request or make it index-level (persistent)

Should be persistent.

…r override - Remove redundant document content handling from HNSW searches (decided not to return doc contents, since we would have to ‘load twice’ in a sense). - Simplify HNSW search to only return docids and scores (above) - Add GET/POST /indexes/{index}/settings endpoints for managing search parameters - Add parameter override storage with a fallback chain (request → override → default) - Need to finish tagging IndexInfo

vincent-4 · 2025-01-28T02:25:51Z

@lintool I took down getting the actual doc contents, since we would have to 'load twice' in a sense– want it back?

Also, updated IndexInfo mappings, but there's quite a few left to do (appreciate input on if I'm screwing up this approach).

lintool · 2025-01-28T02:29:35Z

src/main/java/io/anserini/index/IndexInfo.java

@@ -25,7 +25,14 @@ public enum IndexInfo {
      "BM25",
      new String[] {
          "https://github.com/castorini/anserini-data/raw/master/CACM/lucene-index.cacm.20221005.252b5e.tar.gz" },
+<<<<<<< Updated upstream


lintool · 2025-01-28T02:30:34Z

src/main/java/io/anserini/index/IndexInfo.java

      "cfe14d543c6a27f4d742fb2d0099b8e0"),
+=======
+      "cfe14d543c6a27f4d742fb2d0099b8e0",
+      IndexType.bm25,


IndexType.INVERTED -> inverted index is what allows BM25, and tfidf, etc.

IIRC, all caps is the Java convention?

lintool · 2025-01-28T02:30:55Z

src/main/java/io/anserini/index/IndexInfo.java

@@ -46,7 +56,10 @@ public enum IndexInfo {
      "SPLADE++ EnsembleDistil",
      new String[] {
          "https://rgw.cs.uwaterloo.ca/pyserini/indexes/lucene/lucene-inverted.msmarco-v1-passage.splade-pp-ed.20230524.a59610.tar.gz" },
-      "2c008fc36131e27966a72292932358e6"),
+      "2c008fc36131e27966a72292932358e6",
+      IndexType.flat,


IndexType.DENSE_FLAT?

lintool · 2025-01-28T02:31:08Z

src/main/java/io/anserini/index/IndexInfo.java

      "df4c60fa1f3804fa409499824d12d035"),
+=======
+      "df4c60fa1f3804fa409499824d12d035",
+      IndexType.hnsw,


IndexType.DENSE_HNSW?

lintool · 2025-01-28T02:32:15Z

@lintool I took down getting the actual doc contents, since we would have to 'load twice' in a sense– want it back?

Sorry, don't understand what this means?

lintool · 2025-01-28T02:35:03Z

re: fetching documents - only the inverted indexes store documents.

If I try to fetch doc from an HNSW index, then it should just "re-route" to the inverted index. We can add "pairings" in IndexInfo - e.g., which inverted index goes with which HNSW index.

vincent-4 · 2025-01-28T02:52:52Z

@lintool I took down getting the actual doc contents, since we would have to 'load twice' in a sense– want it back?

Sorry, don't understand what this means?

When we make a request, we get this:

curl -X GET 'http://localhost:8080/api/v1.0/indexes/msmarco-v1-passage.bge-base-en-v1.5.hnsw/search?query=How%20does%20the%20process%20of%20digestion%20and%20metabolism%20of%20carbohydrates%20start&hits=10&efSearch=128&encoder=BgeBaseEn15&queryGenerator=VectorQueryGenerator'
{"query":{"text":"How does the process of digestion and metabolism of carbohydrates start","qid":""},"candidates":[{"docid":"4812922","score":0.9180973,"doc":null},{"docid":"5721972","score":0.9178153,"doc":null},{"docid":"4517519","score":0.91594195,"doc":null},{"docid":"5918340","score":0.9146468,"doc":null},{"docid":"2969775","score":0.91335136,"doc":null},{"docid":"7494976","score":0.91238195,"doc":null},{"docid":"5164081","score":0.9123745,"doc":null},{"docid":"3145583","score":0.91142595,"doc":null},{"docid":"8699089","score":0.9110645,"doc":null},{"docid":"4953055","score":0.9097354,"doc":null}]}

I omit returning "doc" (which normally has contents with non-HNSW). But I get your latest re:fetching documents comment. Thanks!

lintool · 2025-01-28T02:57:42Z

curl -X GET 'http://localhost:8080/api/v1.0/indexes/msmarco-v1-passage.bge-base-en-v1.5.hnsw/search?query=How%20does%20the%20process%20of%20digestion%20and%20metabolism%20of%20carbohydrates%20start&hits=10&efSearch=128&encoder=BgeBaseEn15&queryGenerator=VectorQueryGenerator'

I'm thinking: encoder=BgeBaseEn15&queryGenerator=VectorQueryGenerator - these aren't needed? As in, they can't be anything else... if you have a bge index, you must use the bge encoder, or you'll get gibberish?

vincent-4 · 2025-01-28T03:00:33Z

curl -X GET 'http://localhost:8080/api/v1.0/indexes/msmarco-v1-passage.bge-base-en-v1.5.hnsw/search?query=How%20does%20the%20process%20of%20digestion%20and%20metabolism%20of%20carbohydrates%20start&hits=10&efSearch=128&encoder=BgeBaseEn15&queryGenerator=VectorQueryGenerator'
I'm thinking: encoder=BgeBaseEn15&queryGenerator=VectorQueryGenerator - these aren't needed? As in, they can't be anything else... if you have a bge index, you must use the bge encoder, or you'll get gibberish?

Yeah, that was because:
Prior to adding the tagging, using it with no params is fine. I just had it specified because it wasn't defined the enums yet. But IndexInfo won't compile just yet beacause of the WIP state. But not having to specify is the end goal.

vincent-4 · 2025-01-28T03:08:14Z

Q:

Is there a straightforward way where can I check the naming patterns/mappings? I’m only certain about the most commonly used ones (for instance, bge, since I've used it before) but I can't check my correctness on these

lintool · 2025-01-28T11:34:04Z

Q:

Is there a straightforward way where can I check the naming patterns/mappings? I’m only certain about the most commonly used ones (for instance, bge, since I've used it before) but I can't check my correctness on these

Look in IndexInfo, which contains the ground truth metadata. We've tried to name consistently, but it's evolved over the years. The intention was for the UI to fetch and display the metadata from IndexInfo - to present in a more readable form (and allow user actions like download, etc.).

vincent-4 · 2025-01-28T15:40:13Z

Right, I'm adding ~4 fields to each entry in IndexInfo:

Blocking Q: Is 'beir-v1.0.0-scifact' a 'correct' format for the invertedIndex field name? Or something different, since it isn't nice to fix once done. Another idea is to write it as beir-v1.0.0-scifact.flat and similar for later entries

Also attached the new list of fields for convenience:

vincent-4 · 2025-01-29T04:57:57Z

Updated IndexInfo (not really worth reviewing)– will update later w/ better search that uses the new enum fields

- Simplify IndexInfo by removing redundant getter methods - Remove debug logging from search implementation - Clean up parameter handling in HNSW search - Remove redundant index entries indexinfo indexinfo 2 indexinfo 3

- Add try-with-resources for proper resource cleanup - Fix generic type specification in HnswDenseSearcher - Remove unnecessary document retrieval restriction for HNSW

- Extract initialization logic into separate method - Add parameter validation methods - Improve resource handling with try-with-resources - Remove unnecessary imports and cleanup code structure

- Reorder methods for better logical grouping - Fix indentation and formatting issues - Move field declarations to top of class

BREAKING CHANGES: - Remove ability to search without specifying an index - Change error handling from RuntimeException to IllegalArgumentException for consistency with Anserini patterns Other changes: - Add validation for encoder and queryGenerator settings - Remove default constants for query generator and encoder - Update tests to reflect new error handling and required index parameter

vincent-4 · 2025-01-29T08:33:01Z

On the latest test failure: Error: PrebuiltIndexTest.testNumPrebuiltIndexes:63 expected:<166> but was:<169>
I reverted the changes though? Weird.

Otherwise this is ready for review, will add some more tests

lintool

Thoughts?

lintool · 2025-01-29T12:25:03Z

src/main/java/io/anserini/index/IndexInfo.java

-      "2c008fc36131e27966a72292932358e6"),
+      "2c008fc36131e27966a72292932358e6",
+      IndexType.SPLADE_PP_ED,
+      "SpladePlusPlusEnsembleDistil",


If we changed to SpladePlusPlusEnsembleDistilEncoder.class we can reference an actual class as opposed to string, then empty changes can be nulls. I think I like that better?

(And same for everything below?)

lintool · 2025-01-29T12:27:17Z

src/main/java/io/anserini/server/ControllerV1_0.java


 import org.springframework.web.bind.annotation.PathVariable;
 import org.springframework.web.bind.annotation.RequestParam;
 import org.springframework.web.bind.annotation.RequestMapping;
 import org.springframework.web.bind.annotation.RequestMethod;
 import org.springframework.web.bind.annotation.RestController;
+import org.springframework.http.HttpStatus;
+import org.springframework.web.bind.annotation.ResponseStatus;
+import org.springframework.web.bind.annotation.ExceptionHandler;

 @RestController
 @RequestMapping(path = "/api/v1.0")


Should we increment the API version to, say, v1.1?

(okay old routes are removed, but let's make it explicit?)

vincent-4 · 2025-01-29T14:57:42Z

src/main/java/io/anserini/index/IndexInfo.java

-      "c7294ca988ae1b812d427362ffca1ee2"),
+      "c7294ca988ae1b812d427362ffca1ee2",
+      IndexType.DENSE_HNSW,
+      "CohereEmbedEnglishV30",


Btw, what should be its encoder? I take it literally from the name right above as:

"Lucene quantized (int8) HNSW index of the MS MARCO V1 passage corpus encoded by Cohere embed-english-v3.0.",

But we don't have anything like it? I looked in io/anserini/encoder

Oh, this doesn't have an encoder... you have to call the Cohere API... so should be null for now and probably not searchable.

- Update API version to v1.1 to reflect breaking changes - Standardize error handling style (single line for simple checks) - Clean up code formatting and indentation - Use LinkedHashMap for index info to have new fields, as Map of reached maximum - Change IndexInfo names to use .class suffix. null values specifically in Encoder and QueryGenerator fields. Style: Inverted Index field still has empty strings - Simplify conditional assignments with single line statements

- Remove redundant null check for index parameter since @PathVariable(required = true) handles it - Fix document endpoint to use cached SearchService instead of creating new instances - Add index existence check before document retrieval - Add HNSW index type validation for document retrieval - Standardize error handling format across endpoints

…n-existent Cohere encoder

codecov · 2025-01-29T16:34:43Z

Codecov Report

Attention: Patch coverage is 40.27778% with 86 lines in your changes missing coverage. Please review.

Project coverage is 66.71%. Comparing base (df900e8) to head (3922dbe).

Files with missing lines	Patch %	Lines
...rc/main/java/io/anserini/server/SearchService.java	37.86%	55 Missing and 9 partials ⚠️
...c/main/java/io/anserini/server/ControllerV1_0.java	46.34%	22 Missing ⚠️

Additional details and impacted files

@@             Coverage Diff              @@
##             master    #2691      +/-   ##
============================================
- Coverage     67.04%   66.71%   -0.33%     
- Complexity     1185     1189       +4     
============================================
  Files           183      183              
  Lines         11342    11455     +113     
  Branches       1372     1395      +23     
============================================
+ Hits           7604     7642      +38     
- Misses         3232     3300      +68     
- Partials        506      513       +7

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

…ervice tests

lintool

lgtm

vincent-4 added 4 commits January 23, 2025 11:49

use primitive

f77f1bc

refactor method

checks

1ed836e

don't require parms

37597ff

lintool reviewed Jan 23, 2025

View reviewed changes

vincent-4 added 2 commits January 27, 2025 21:20

use indexInfo settings

4cbf066

lintool reviewed Jan 28, 2025

View reviewed changes

indexinfo

75ba6e0

vincent-4 added 5 commits January 29, 2025 02:59

fix and simplify: Improve resource handling and type safety

7279255

- Add try-with-resources for proper resource cleanup - Fix generic type specification in HnswDenseSearcher - Remove unnecessary document retrieval restriction for HNSW

refactors: Improve code organization and error handling

b545008

- Extract initialization logic into separate method - Add parameter validation methods - Improve resource handling with try-with-resources - Remove unnecessary imports and cleanup code structure

re-order: Improve code readability and organization

c71251f

- Reorder methods for better logical grouping - Fix indentation and formatting issues - Move field declarations to top of class

vincent-4 changed the title ~~hnsw endpoint added~~ [ignore this commit... ffixed indexinfo deletions] Jan 29, 2025

un-remove the 3 removed indexinfo entries -- accident

6bfb91b

vincent-4 force-pushed the hnsw-api-fix branch from 46e62b0 to 6bfb91b Compare January 29, 2025 08:21

vincent-4 marked this pull request as ready for review January 29, 2025 08:33

vincent-4 changed the title ~~[ignore this commit... ffixed indexinfo deletions]~~ Add HNSW support to the localhost API interface Jan 29, 2025

lintool requested changes Jan 29, 2025

View reviewed changes

vincent-4 commented Jan 29, 2025

View reviewed changes

vincent-4 added 4 commits January 29, 2025 10:19

refactor(api): improve encoder/generator class handling and remove no…

273eb06

…n-existent Cohere encoder

Undo change to testNumPrebuiltIndexes - 169 as it was before

3922dbe

refactor(server): rename ControllerV1_0 to Controller and add SearchS…

f0aaf5a

…ervice tests

lintool approved these changes Jan 29, 2025

View reviewed changes

lintool merged commit 25b8f95 into castorini:master Jan 29, 2025
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add HNSW support to the localhost API interface #2691

Add HNSW support to the localhost API interface #2691

vincent-4 commented Jan 23, 2025

lintool Jan 23, 2025

lintool Jan 23, 2025

vincent-4 Jan 23, 2025

lintool Jan 23, 2025

lintool commented Jan 23, 2025

vincent-4 commented Jan 23, 2025

lintool commented Jan 23, 2025

vincent-4 commented Jan 28, 2025

lintool Jan 28, 2025

lintool Jan 28, 2025

lintool Jan 28, 2025

lintool Jan 28, 2025

lintool commented Jan 28, 2025

lintool commented Jan 28, 2025

vincent-4 commented Jan 28, 2025

lintool commented Jan 28, 2025

vincent-4 commented Jan 28, 2025

vincent-4 commented Jan 28, 2025 •

edited

Loading

lintool commented Jan 28, 2025

vincent-4 commented Jan 28, 2025 •

edited

Loading

vincent-4 commented Jan 29, 2025

vincent-4 commented Jan 29, 2025

lintool left a comment

lintool Jan 29, 2025

vincent-4 Jan 29, 2025

lintool Jan 29, 2025

vincent-4 Jan 29, 2025

lintool Jan 29, 2025

codecov bot commented Jan 29, 2025

lintool left a comment

Add HNSW support to the localhost API interface #2691

Add HNSW support to the localhost API interface #2691

Conversation

vincent-4 commented Jan 23, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lintool commented Jan 23, 2025

vincent-4 commented Jan 23, 2025

lintool commented Jan 23, 2025

vincent-4 commented Jan 28, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lintool commented Jan 28, 2025

lintool commented Jan 28, 2025

vincent-4 commented Jan 28, 2025

lintool commented Jan 28, 2025

vincent-4 commented Jan 28, 2025

vincent-4 commented Jan 28, 2025 • edited Loading

lintool commented Jan 28, 2025

vincent-4 commented Jan 28, 2025 • edited Loading

vincent-4 commented Jan 29, 2025

vincent-4 commented Jan 29, 2025

lintool left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codecov bot commented Jan 29, 2025

Codecov Report

lintool left a comment

Choose a reason for hiding this comment

vincent-4 commented Jan 28, 2025 •

edited

Loading

vincent-4 commented Jan 28, 2025 •

edited

Loading