This repo includes:
- Indexer of documents stored in WARC format
- A query processor
The Indexer outputs an index in a specific format that the query processor understands. The query processor takes the index and a set of queries as parameters, and outputs a document rank of the documents for each query.
Execute the indexer as follows:
python3 indexer.py -m <MEMORY> -c <CORPUS> -i <INDEX>For example, to execute the indexer with a memory limit of 1024 MB, using the
corpus files contained in the data/corpus directory, and output the index to
index.out, run:
python3 indexer.py -m 1024 -c data/corpus -i index.outExecute the query processor as follows. The parameter <INDEX> is the file
output by the Indexer.
python3 processor.py -i <INDEX> -q <QUERIES> -r <RANKER>For example, to execute the query processor using the index index.out, the
queries in the file queries-sample.txt and the ranker BM25, run:
python3 processor.py -i index.out -q queries-sample.txt -r BM25The available rankers are BM25 AND TFIDF.