Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add download from remote feature #2301

Merged
merged 9 commits into from
Dec 19, 2023

Conversation

ArthurChen189
Copy link
Member

@ArthurChen189 ArthurChen189 commented Dec 11, 2023

Sample command:

target/appassembler/bin/SearchCollection
-index msmarco-v1-passage
-topics tools/topics-and-qrels/topics.msmarco-passage.dev-subset.txt
-topicReader TsvInt
-output runs/run.msmarco-passage.bm25-default.topics.msmarco-passage.dev-subset.txt -bm25

@ArthurChen189 ArthurChen189 marked this pull request as draft December 11, 2023 20:44
Copy link

codecov bot commented Dec 13, 2023

Codecov Report

Attention: 37 lines in your changes are missing coverage. Please review.

Comparison is base (2c14a49) 64.29% compared to head (6bbc664) 64.37%.

Files Patch % Lines
...in/java/io/anserini/util/PrebuiltIndexHandler.java 67.85% 14 Missing and 13 partials ⚠️
...main/java/io/anserini/search/SearchCollection.java 50.00% 6 Missing ⚠️
src/main/java/io/anserini/index/IndexInfo.java 89.18% 4 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##             master    #2301      +/-   ##
============================================
+ Coverage     64.29%   64.37%   +0.07%     
- Complexity     1333     1352      +19     
============================================
  Files           203      205       +2     
  Lines         11300    11432     +132     
  Branches       1426     1439      +13     
============================================
+ Hits           7265     7359      +94     
- Misses         3558     3583      +25     
- Partials        477      490      +13     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@ArthurChen189 ArthurChen189 marked this pull request as ready for review December 13, 2023 20:50
@lintool
Copy link
Member

lintool commented Dec 13, 2023

hi @ArthurChen189 can you give me sample commands to verify?

@lintool
Copy link
Member

lintool commented Dec 13, 2023

Linking to #2301

@ArthurChen189
Copy link
Member Author

Sorry, forgot to provide a sample command:

target/appassembler/bin/SearchCollection
-index msmarco-v1-passage
-topics tools/topics-and-qrels/topics.msmarco-passage.dev-subset.txt
-topicReader TsvInt
-output runs/run.msmarco-passage.bm25-default.topics.msmarco-passage.dev-subset.txt -bm25

Copy link
Member

@lintool lintool left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

round of comments.

src/main/java/io/anserini/index/IndexInfo.java Outdated Show resolved Hide resolved
src/main/java/io/anserini/util/PrebuiltIndexHandler.java Outdated Show resolved Hide resolved
@lintool
Copy link
Member

lintool commented Dec 13, 2023

WARNING?

[INFO] Running io.anserini.index.PrebuiltIndexHandlerTest
WARNING: An illegal reflective access operation has occurred
WARNING: Illegal reflective access by org.jline.terminal.impl.exec.ExecTerminalProvider (file:/Users/jimmylin/.m2/repository/org/jline/jline/3.23.0/jline-3.23.0.jar) to constructor java.lang.ProcessBuilder$RedirectPipeImpl()
WARNING: Please consider reporting this to the maintainers of org.jline.terminal.impl.exec.ExecTerminalProvider
WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations
WARNING: All illegal access operations will be denied in a future release
msmarco-v1-passage 100% │██│ 2170758/2170758 (0:01:38 / 0:00:00) Downloading...
File downloaded successfully (MD5 check passed)!
msmarco-v1-passage 100% │██│ 2170758/2170758 (0:01:38 / 0:00:00) Downloading...
Decompressing index...
Index decompressed successfully!

Also, maybe something more lightweight for a unit test, like https://github.com/castorini/anserini-data/tree/master/CACM - lucene9-index.cacm.tar.gz?

This is what I use for some test cases that need a "real" index to download.

Equivalent test in Pyserini: https://github.com/castorini/pyserini/blob/master/tests/test_index_download.py

@lintool
Copy link
Member

lintool commented Dec 13, 2023

Very nice, works! 🎉

% target/appassembler/bin/SearchCollection \                                     
-index msmarco-v1-passage \
-topics tools/topics-and-qrels/topics.msmarco-passage.dev-subset.txt \
-topicReader TsvInt \
-output runs/run.msmarco-passage.bm25-default.topics.msmarco-passage.dev-subset.txt -bm25

msmarco-v1-passage 100% │█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████│ 2170758/2170758 (0:02:25 / 0:00:00) Downloading...
msmarco-v1-passage 100% │█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████│ 2170758/2170758 (0:02:25 / 0:00:00) Downloading...
Decompressing index...
Index decompressed successfully!
2023-12-13 18:49:02,174 INFO  [main] search.SearchCollection (SearchCollection.java:981) - ============ Initializing Searcher ============
2023-12-13 18:49:02,176 INFO  [main] search.SearchCollection (SearchCollection.java:982) - Index: /Users/jimmylin/.cache/pyserini/indexes/lucene-index.msmarco-v1-passage.20221004.252b5e
2023-12-13 18:49:02,302 INFO  [main] search.SearchCollection (SearchCollection.java:986) - Fields: []
2023-12-13 18:49:02,302 INFO  [main] search.SearchCollection (SearchCollection.java:697) - Using DefaultEnglishAnalyzer
2023-12-13 18:49:02,303 INFO  [main] search.SearchCollection (SearchCollection.java:698) - Stemmer: porter
2023-12-13 18:49:02,303 INFO  [main] search.SearchCollection (SearchCollection.java:699) - Keep stopwords? false
2023-12-13 18:49:02,303 INFO  [main] search.SearchCollection (SearchCollection.java:700) - Stopwords file: null
2023-12-13 18:49:02,342 INFO  [main] search.SearchCollection (SearchCollection.java:1258) - runtag: Anserini
2023-12-13 18:49:02,344 INFO  [main] search.SearchCollection (SearchCollection.java:1264) - ============ Launching Search Threads ============
2023-12-13 18:49:04,533 INFO  [pool-4-thread-5] search.SearchCollection$SearcherThread (SearchCollection.java:881) - ranker: bm25(k1=0.9,b=0.4), reranker: default: 100 queries processed
...

Although why does it show progress bar twice?

@jasper-xian
Copy link
Member

Trying the command myself on orca, running into an error:

Index file already exists! Skip downloading.
Decompressing index...
Index decompressed successfully!
Index path 'msmarco-v1-passage' does not exist or is not a directory.

In my pyserini cache, I see the tarfile but no unzipped file, and the command runs pretty quickly, so seems to me that there was an issue in unzipping? I printed the paths to the tarfile and they seem to match up, as far as I'm aware.

Also, would it be possible to use an environment variable for PYSERINI_HOME similar to how python handles it? I'm not sure if this is possible in Java, but would be great as I have my cache set outside of my home directory on orca to save space.

@ArthurChen189
Copy link
Member Author

In my pyserini cache, I see the tarfile but no unzipped file, and the command runs pretty quickly, so seems to me that there was an issue in unzipping? I printed the paths to the tarfile and they seem to match up, as far as I'm aware.

Thanks for letting me know. I think that was due to an interruption during the download, therefore having an unchecked and incomplete tarball file. Could you delete it and run the commands again? Thanks!

@ArthurChen189
Copy link
Member Author

Got some unexpected errors after merging, changing this to draft

@ArthurChen189 ArthurChen189 marked this pull request as draft December 19, 2023 01:30
@ArthurChen189 ArthurChen189 marked this pull request as ready for review December 19, 2023 17:56
@lintool
Copy link
Member

lintool commented Dec 19, 2023

Tried the following and it worked!

target/appassembler/bin/SearchCollection \
  -index msmarco-v1-passage \
  -topics tools/topics-and-qrels/topics.msmarco-passage.dev-subset.txt \
  -topicReader TsvInt \
  -output runs/run.msmarco-passage.bm25-default.topics.msmarco-passage.dev-subset.txt -bm25

@ArthurChen189 let's go ahead and merge... we can circle back if there are additional issues.

In your commit msg, be sure to mention "pre-built indexes" somewhere...

@ArthurChen189 ArthurChen189 merged commit 45591ab into castorini:master Dec 19, 2023
3 checks passed
@ArthurChen189 ArthurChen189 deleted the add-download branch December 19, 2023 18:53
@jasper-xian
Copy link
Member

Tested the same command on my M1 Macbook Pro, worked for me as well!

target/appassembler/bin/SearchCollection \
  -index msmarco-v1-passage \
  -topics tools/topics-and-qrels/topics.msmarco-passage.dev-subset.txt \
  -topicReader TsvInt \
  -output runs/run.msmarco-passage.bm25-default.topics.msmarco-passage.dev-subset.txt -bm25

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants