-
Notifications
You must be signed in to change notification settings - Fork 471
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add HNSW support to the localhost API interface #2691
Conversation
Example usage: curl -X GET 'http://localhost:8080/api/v1.0/indexes/msmarco-v1-passage.bge-base-en-v1.5.hnsw/search?query=How%20does%20the%20process%20of%20digestion%20and%20metabolism%20of%20carbohydrates%20start&hits=10&efSearch=128&encoder=BgeBaseEn15&queryGenerator=VectorQueryGenerator'
refactor method
if (encoder == null || queryGenerator == null) { | ||
throw new IllegalArgumentException("HNSW indexes require both 'encoder' and 'queryGenerator' parameters"); | ||
} | ||
if (index.contains(".hnsw") && (encoder == null || queryGenerator == null)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why don't we add a new field in the Enum
in IndexInfo
so we don't have to do this janky check?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indexes need to be paired with generators and encoders, right? so put that in IndexInfo
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why don't we add a new field in the
Enum
inIndexInfo
so we don't have to do this janky check?
For each enum? Or something different
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, why not? Add another field in the enum denoting indexing type, the encoder, and the query generator?
Another thought: maybe have defaults hard-coded in Something like that? |
I guess we could have it per-request or make it index-level (persistent). Does the usual use case make it easier to work with the first or second approach? Or could add both possibilities but more complex |
Should be persistent. |
…r override - Remove redundant document content handling from HNSW searches (decided not to return doc contents, since we would have to ‘load twice’ in a sense). - Simplify HNSW search to only return docids and scores (above) - Add GET/POST /indexes/{index}/settings endpoints for managing search parameters - Add parameter override storage with a fallback chain (request → override → default) - Need to finish tagging IndexInfo
@lintool I took down getting the actual doc contents, since we would have to 'load twice' in a sense– want it back? Also, updated IndexInfo mappings, but there's quite a few left to do (appreciate input on if I'm screwing up this approach). |
@@ -25,7 +25,14 @@ public enum IndexInfo { | |||
"BM25", | |||
new String[] { | |||
"https://github.com/castorini/anserini-data/raw/master/CACM/lucene-index.cacm.20221005.252b5e.tar.gz" }, | |||
<<<<<<< Updated upstream |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Resolve?
"cfe14d543c6a27f4d742fb2d0099b8e0"), | ||
======= | ||
"cfe14d543c6a27f4d742fb2d0099b8e0", | ||
IndexType.bm25, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IndexType.INVERTED
-> inverted index is what allows BM25, and tfidf, etc.
IIRC, all caps is the Java convention?
@@ -46,7 +56,10 @@ public enum IndexInfo { | |||
"SPLADE++ EnsembleDistil", | |||
new String[] { | |||
"https://rgw.cs.uwaterloo.ca/pyserini/indexes/lucene/lucene-inverted.msmarco-v1-passage.splade-pp-ed.20230524.a59610.tar.gz" }, | |||
"2c008fc36131e27966a72292932358e6"), | |||
"2c008fc36131e27966a72292932358e6", | |||
IndexType.flat, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IndexType.DENSE_FLAT
?
"df4c60fa1f3804fa409499824d12d035"), | ||
======= | ||
"df4c60fa1f3804fa409499824d12d035", | ||
IndexType.hnsw, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IndexType.DENSE_HNSW
?
Sorry, don't understand what this means? |
re: fetching documents - only the inverted indexes store documents. If I try to fetch doc from an HNSW index, then it should just "re-route" to the inverted index. We can add "pairings" in |
When we make a request, we get this: curl -X GET 'http://localhost:8080/api/v1.0/indexes/msmarco-v1-passage.bge-base-en-v1.5.hnsw/search?query=How%20does%20the%20process%20of%20digestion%20and%20metabolism%20of%20carbohydrates%20start&hits=10&efSearch=128&encoder=BgeBaseEn15&queryGenerator=VectorQueryGenerator' I omit returning "doc" (which normally has contents with non-HNSW). But I get your latest re:fetching documents comment. Thanks! |
I'm thinking: |
Yeah, that was because: |
Q:
|
Look in |
Right, I'm adding ~4 fields to each entry in IndexInfo: Blocking Q: Is 'beir-v1.0.0-scifact' a 'correct' format for the invertedIndex field name? Or something different, since it isn't nice to fix once done. Another idea is to write it as beir-v1.0.0-scifact.flat and similar for later entries |
Updated IndexInfo (not really worth reviewing)– will update later w/ better search that uses the new enum fields |
- Simplify IndexInfo by removing redundant getter methods - Remove debug logging from search implementation - Clean up parameter handling in HNSW search - Remove redundant index entries indexinfo indexinfo 2 indexinfo 3
- Add try-with-resources for proper resource cleanup - Fix generic type specification in HnswDenseSearcher - Remove unnecessary document retrieval restriction for HNSW
- Extract initialization logic into separate method - Add parameter validation methods - Improve resource handling with try-with-resources - Remove unnecessary imports and cleanup code structure
- Reorder methods for better logical grouping - Fix indentation and formatting issues - Move field declarations to top of class
BREAKING CHANGES: - Remove ability to search without specifying an index - Change error handling from RuntimeException to IllegalArgumentException for consistency with Anserini patterns Other changes: - Add validation for encoder and queryGenerator settings - Remove default constants for query generator and encoder - Update tests to reflect new error handling and required index parameter
46e62b0
to
6bfb91b
Compare
On the latest test failure: Error: PrebuiltIndexTest.testNumPrebuiltIndexes:63 expected:<166> but was:<169> Otherwise this is ready for review, will add some more tests |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thoughts?
"2c008fc36131e27966a72292932358e6"), | ||
"2c008fc36131e27966a72292932358e6", | ||
IndexType.SPLADE_PP_ED, | ||
"SpladePlusPlusEnsembleDistil", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we changed to SpladePlusPlusEnsembleDistilEncoder.class
we can reference an actual class as opposed to string, then empty changes can be null
s. I think I like that better?
(And same for everything below?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep
|
||
import org.springframework.web.bind.annotation.PathVariable; | ||
import org.springframework.web.bind.annotation.RequestParam; | ||
import org.springframework.web.bind.annotation.RequestMapping; | ||
import org.springframework.web.bind.annotation.RequestMethod; | ||
import org.springframework.web.bind.annotation.RestController; | ||
import org.springframework.http.HttpStatus; | ||
import org.springframework.web.bind.annotation.ResponseStatus; | ||
import org.springframework.web.bind.annotation.ExceptionHandler; | ||
|
||
@RestController | ||
@RequestMapping(path = "/api/v1.0") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we increment the API version to, say, v1.1?
(okay old routes are removed, but let's make it explicit?)
"c7294ca988ae1b812d427362ffca1ee2"), | ||
"c7294ca988ae1b812d427362ffca1ee2", | ||
IndexType.DENSE_HNSW, | ||
"CohereEmbedEnglishV30", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Btw, what should be its encoder? I take it literally from the name right above as:
"Lucene quantized (int8) HNSW index of the MS MARCO V1 passage corpus encoded by Cohere embed-english-v3.0.",
But we don't have anything like it? I looked in io/anserini/encoder
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, this doesn't have an encoder... you have to call the Cohere API... so should be null
for now and probably not searchable.
- Update API version to v1.1 to reflect breaking changes - Standardize error handling style (single line for simple checks) - Clean up code formatting and indentation - Use LinkedHashMap for index info to have new fields, as Map of reached maximum - Change IndexInfo names to use .class suffix. null values specifically in Encoder and QueryGenerator fields. Style: Inverted Index field still has empty strings - Simplify conditional assignments with single line statements
- Remove redundant null check for index parameter since @PathVariable(required = true) handles it - Fix document endpoint to use cached SearchService instead of creating new instances - Add index existence check before document retrieval - Add HNSW index type validation for document retrieval - Standardize error handling format across endpoints
…n-existent Cohere encoder
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #2691 +/- ##
============================================
- Coverage 67.04% 66.71% -0.33%
- Complexity 1185 1189 +4
============================================
Files 183 183
Lines 11342 11455 +113
Branches 1372 1395 +23
============================================
+ Hits 7604 7642 +38
- Misses 3232 3300 +68
- Partials 506 513 +7 ☔ View full report in Codecov by Sentry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
Addresses #2688
Example usage:
curl -X GET 'http://localhost:8080/api/v1.0/indexes/msmarco-v1-passage.bge-base-en-v1.5.hnsw/search?query=How%20does%20the%20process%20of%20digestion%20and%20metabolism%20of%20carbohydrates%20start&hits=10&efSearch=128&encoder=BgeBaseEn15&queryGenerator=VectorQueryGenerator'
Although would a different API querying method be better? We specify the encoder and generator but without the additional params, (user specified encoder and querygen) it won't work. But the existing usage is not changed, eg
curl -X GET "http://localhost:8081/api/v1.0/indexes/msmarco-v1-passage.bge-base-en-v1.5.hnsw/search?query=How%20does%20the%20process%20of%20digestion%20and%20metabolism%20of%20carbohydrates%20start"
is fine.