Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add HNSW support to the localhost API interface #2691

Merged
merged 18 commits into from
Jan 29, 2025

Conversation

vincent-4
Copy link
Contributor

Addresses #2688

  • Keep BM25 as base case, treat HNSW as extension to maintain backward compatibility

Example usage:
curl -X GET 'http://localhost:8080/api/v1.0/indexes/msmarco-v1-passage.bge-base-en-v1.5.hnsw/search?query=How%20does%20the%20process%20of%20digestion%20and%20metabolism%20of%20carbohydrates%20start&hits=10&efSearch=128&encoder=BgeBaseEn15&queryGenerator=VectorQueryGenerator'

Although would a different API querying method be better? We specify the encoder and generator but without the additional params, (user specified encoder and querygen) it won't work. But the existing usage is not changed, eg
curl -X GET "http://localhost:8081/api/v1.0/indexes/msmarco-v1-passage.bge-base-en-v1.5.hnsw/search?query=How%20does%20the%20process%20of%20digestion%20and%20metabolism%20of%20carbohydrates%20start" is fine.

Example usage:
curl -X GET 'http://localhost:8080/api/v1.0/indexes/msmarco-v1-passage.bge-base-en-v1.5.hnsw/search?query=How%20does%20the%20process%20of%20digestion%20and%20metabolism%20of%20carbohydrates%20start&hits=10&efSearch=128&encoder=BgeBaseEn15&queryGenerator=VectorQueryGenerator'
refactor method
if (encoder == null || queryGenerator == null) {
throw new IllegalArgumentException("HNSW indexes require both 'encoder' and 'queryGenerator' parameters");
}
if (index.contains(".hnsw") && (encoder == null || queryGenerator == null)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why don't we add a new field in the Enum in IndexInfo so we don't have to do this janky check?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indexes need to be paired with generators and encoders, right? so put that in IndexInfo?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why don't we add a new field in the Enum in IndexInfo so we don't have to do this janky check?

For each enum? Or something different

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, why not? Add another field in the enum denoting indexing type, the encoder, and the query generator?

@lintool
Copy link
Member

lintool commented Jan 23, 2025

Another thought: maybe have defaults hard-coded in IndexInfo, and then use api/v1.0/indexes/msmarco-v1-passage.bge-base-en-v1.5.hnsw/setting?param1=value1&param2=value2 to set?

Something like that?

@vincent-4
Copy link
Contributor Author

Another thought: maybe have defaults hard-coded in IndexInfo, and then use api/v1.0/indexes/msmarco-v1-passage.bge-base-en-v1.5.hnsw/setting?param1=value1&param2=value2 to set?

Something like that?

I guess we could have it per-request or make it index-level (persistent). Does the usual use case make it easier to work with the first or second approach? Or could add both possibilities but more complex

@lintool
Copy link
Member

lintool commented Jan 23, 2025

per-request or make it index-level (persistent)

Should be persistent.

…r override

- Remove redundant document content handling from HNSW searches (decided not to return doc contents, since we would have to ‘load twice’ in a sense).

- Simplify HNSW search to only return docids and scores (above)

- Add GET/POST /indexes/{index}/settings endpoints for managing search parameters

- Add parameter override storage with a fallback chain (request → override → default)

- Need to finish tagging IndexInfo
@vincent-4
Copy link
Contributor Author

@lintool I took down getting the actual doc contents, since we would have to 'load twice' in a sense– want it back?

Also, updated IndexInfo mappings, but there's quite a few left to do (appreciate input on if I'm screwing up this approach).

@@ -25,7 +25,14 @@ public enum IndexInfo {
"BM25",
new String[] {
"https://github.com/castorini/anserini-data/raw/master/CACM/lucene-index.cacm.20221005.252b5e.tar.gz" },
<<<<<<< Updated upstream
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Resolve?

"cfe14d543c6a27f4d742fb2d0099b8e0"),
=======
"cfe14d543c6a27f4d742fb2d0099b8e0",
IndexType.bm25,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IndexType.INVERTED -> inverted index is what allows BM25, and tfidf, etc.

IIRC, all caps is the Java convention?

@@ -46,7 +56,10 @@ public enum IndexInfo {
"SPLADE++ EnsembleDistil",
new String[] {
"https://rgw.cs.uwaterloo.ca/pyserini/indexes/lucene/lucene-inverted.msmarco-v1-passage.splade-pp-ed.20230524.a59610.tar.gz" },
"2c008fc36131e27966a72292932358e6"),
"2c008fc36131e27966a72292932358e6",
IndexType.flat,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IndexType.DENSE_FLAT?

"df4c60fa1f3804fa409499824d12d035"),
=======
"df4c60fa1f3804fa409499824d12d035",
IndexType.hnsw,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IndexType.DENSE_HNSW?

@lintool
Copy link
Member

lintool commented Jan 28, 2025

@lintool I took down getting the actual doc contents, since we would have to 'load twice' in a sense– want it back?

Sorry, don't understand what this means?

@lintool
Copy link
Member

lintool commented Jan 28, 2025

re: fetching documents - only the inverted indexes store documents.

If I try to fetch doc from an HNSW index, then it should just "re-route" to the inverted index. We can add "pairings" in IndexInfo - e.g., which inverted index goes with which HNSW index.

@vincent-4
Copy link
Contributor Author

@lintool I took down getting the actual doc contents, since we would have to 'load twice' in a sense– want it back?

Sorry, don't understand what this means?

When we make a request, we get this:

curl -X GET 'http://localhost:8080/api/v1.0/indexes/msmarco-v1-passage.bge-base-en-v1.5.hnsw/search?query=How%20does%20the%20process%20of%20digestion%20and%20metabolism%20of%20carbohydrates%20start&hits=10&efSearch=128&encoder=BgeBaseEn15&queryGenerator=VectorQueryGenerator'
{"query":{"text":"How does the process of digestion and metabolism of carbohydrates start","qid":""},"candidates":[{"docid":"4812922","score":0.9180973,"doc":null},{"docid":"5721972","score":0.9178153,"doc":null},{"docid":"4517519","score":0.91594195,"doc":null},{"docid":"5918340","score":0.9146468,"doc":null},{"docid":"2969775","score":0.91335136,"doc":null},{"docid":"7494976","score":0.91238195,"doc":null},{"docid":"5164081","score":0.9123745,"doc":null},{"docid":"3145583","score":0.91142595,"doc":null},{"docid":"8699089","score":0.9110645,"doc":null},{"docid":"4953055","score":0.9097354,"doc":null}]}

I omit returning "doc" (which normally has contents with non-HNSW). But I get your latest re:fetching documents comment. Thanks!

@lintool
Copy link
Member

lintool commented Jan 28, 2025

curl -X GET 'http://localhost:8080/api/v1.0/indexes/msmarco-v1-passage.bge-base-en-v1.5.hnsw/search?query=How%20does%20the%20process%20of%20digestion%20and%20metabolism%20of%20carbohydrates%20start&hits=10&efSearch=128&encoder=BgeBaseEn15&queryGenerator=VectorQueryGenerator'

I'm thinking: encoder=BgeBaseEn15&queryGenerator=VectorQueryGenerator - these aren't needed? As in, they can't be anything else... if you have a bge index, you must use the bge encoder, or you'll get gibberish?

@vincent-4
Copy link
Contributor Author

curl -X GET 'http://localhost:8080/api/v1.0/indexes/msmarco-v1-passage.bge-base-en-v1.5.hnsw/search?query=How%20does%20the%20process%20of%20digestion%20and%20metabolism%20of%20carbohydrates%20start&hits=10&efSearch=128&encoder=BgeBaseEn15&queryGenerator=VectorQueryGenerator'

I'm thinking: encoder=BgeBaseEn15&queryGenerator=VectorQueryGenerator - these aren't needed? As in, they can't be anything else... if you have a bge index, you must use the bge encoder, or you'll get gibberish?

Yeah, that was because:
Prior to adding the tagging, using it with no params is fine. I just had it specified because it wasn't defined the enums yet. But IndexInfo won't compile just yet beacause of the WIP state. But not having to specify is the end goal.

@vincent-4
Copy link
Contributor Author

vincent-4 commented Jan 28, 2025

Q:

  • Is there a straightforward way where can I check the naming patterns/mappings? I’m only certain about the most commonly used ones (for instance, bge, since I've used it before) but I can't check my correctness on these

@lintool
Copy link
Member

lintool commented Jan 28, 2025

Q:

  • Is there a straightforward way where can I check the naming patterns/mappings? I’m only certain about the most commonly used ones (for instance, bge, since I've used it before) but I can't check my correctness on these

Look in IndexInfo, which contains the ground truth metadata. We've tried to name consistently, but it's evolved over the years. The intention was for the UI to fetch and display the metadata from IndexInfo - to present in a more readable form (and allow user actions like download, etc.).

@vincent-4
Copy link
Contributor Author

vincent-4 commented Jan 28, 2025

CleanShot 2025-01-28 at 10 26 45@2x

Right, I'm adding ~4 fields to each entry in IndexInfo:

Blocking Q: Is 'beir-v1.0.0-scifact' a 'correct' format for the invertedIndex field name? Or something different, since it isn't nice to fix once done. Another idea is to write it as beir-v1.0.0-scifact.flat and similar for later entries

Also attached the new list of fields for convenience:
CleanShot 2025-01-28 at 10 46 20@2x

@vincent-4
Copy link
Contributor Author

Updated IndexInfo (not really worth reviewing)– will update later w/ better search that uses the new enum fields

- Simplify IndexInfo by removing redundant getter methods
- Remove debug logging from search implementation
- Clean up parameter handling in HNSW search
- Remove redundant index entries

indexinfo

indexinfo 2

indexinfo 3
- Add try-with-resources for proper resource cleanup
- Fix generic type specification in HnswDenseSearcher
- Remove unnecessary document retrieval restriction for HNSW
- Extract initialization logic into separate method
- Add parameter validation methods
- Improve resource handling with try-with-resources
- Remove unnecessary imports and cleanup code structure
- Reorder methods for better logical grouping
- Fix indentation and formatting issues
- Move field declarations to top of class
BREAKING CHANGES:
- Remove ability to search without specifying an index
- Change error handling from RuntimeException to IllegalArgumentException
  for consistency with Anserini patterns

Other changes:
- Add validation for encoder and queryGenerator settings
- Remove default constants for query generator and encoder
- Update tests to reflect new error handling and required index parameter
@vincent-4 vincent-4 changed the title hnsw endpoint added [ignore this commit... ffixed indexinfo deletions] Jan 29, 2025
@vincent-4
Copy link
Contributor Author

On the latest test failure: Error: PrebuiltIndexTest.testNumPrebuiltIndexes:63 expected:<166> but was:<169>
I reverted the changes though? Weird.

Otherwise this is ready for review, will add some more tests

@vincent-4 vincent-4 marked this pull request as ready for review January 29, 2025 08:33
@vincent-4 vincent-4 changed the title [ignore this commit... ffixed indexinfo deletions] Add HNSW support to the localhost API interface Jan 29, 2025
Copy link
Member

@lintool lintool left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thoughts?

"2c008fc36131e27966a72292932358e6"),
"2c008fc36131e27966a72292932358e6",
IndexType.SPLADE_PP_ED,
"SpladePlusPlusEnsembleDistil",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we changed to SpladePlusPlusEnsembleDistilEncoder.class we can reference an actual class as opposed to string, then empty changes can be nulls. I think I like that better?

(And same for everything below?)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep


import org.springframework.web.bind.annotation.PathVariable;
import org.springframework.web.bind.annotation.RequestParam;
import org.springframework.web.bind.annotation.RequestMapping;
import org.springframework.web.bind.annotation.RequestMethod;
import org.springframework.web.bind.annotation.RestController;
import org.springframework.http.HttpStatus;
import org.springframework.web.bind.annotation.ResponseStatus;
import org.springframework.web.bind.annotation.ExceptionHandler;

@RestController
@RequestMapping(path = "/api/v1.0")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we increment the API version to, say, v1.1?

(okay old routes are removed, but let's make it explicit?)

"c7294ca988ae1b812d427362ffca1ee2"),
"c7294ca988ae1b812d427362ffca1ee2",
IndexType.DENSE_HNSW,
"CohereEmbedEnglishV30",
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Btw, what should be its encoder? I take it literally from the name right above as:

  "Lucene quantized (int8) HNSW index of the MS MARCO V1 passage corpus encoded by Cohere embed-english-v3.0.",

But we don't have anything like it? I looked in io/anserini/encoder

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, this doesn't have an encoder... you have to call the Cohere API... so should be null for now and probably not searchable.

- Update API version to v1.1 to reflect breaking changes
- Standardize error handling style (single line for simple checks)
- Clean up code formatting and indentation
- Use LinkedHashMap for index info to have new fields, as Map of reached
maximum
- Change IndexInfo names to use .class suffix. null values specifically
in Encoder and QueryGenerator fields. Style: Inverted Index field still
has empty strings
- Simplify conditional assignments with single line statements
- Remove redundant null check for index parameter since
@PathVariable(required = true) handles it
- Fix document endpoint to use cached SearchService instead of creating
new instances
- Add index existence check before document retrieval
- Add HNSW index type validation for document retrieval
- Standardize error handling format across endpoints
Copy link

codecov bot commented Jan 29, 2025

Codecov Report

Attention: Patch coverage is 40.27778% with 86 lines in your changes missing coverage. Please review.

Project coverage is 66.71%. Comparing base (df900e8) to head (3922dbe).

Files with missing lines Patch % Lines
...rc/main/java/io/anserini/server/SearchService.java 37.86% 55 Missing and 9 partials ⚠️
...c/main/java/io/anserini/server/ControllerV1_0.java 46.34% 22 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##             master    #2691      +/-   ##
============================================
- Coverage     67.04%   66.71%   -0.33%     
- Complexity     1185     1189       +4     
============================================
  Files           183      183              
  Lines         11342    11455     +113     
  Branches       1372     1395      +23     
============================================
+ Hits           7604     7642      +38     
- Misses         3232     3300      +68     
- Partials        506      513       +7     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Member

@lintool lintool left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@lintool lintool merged commit 25b8f95 into castorini:master Jan 29, 2025
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants