Ankit Agrawal 2581532
Akshay Joshi 2581346
Abstract: In this final project, the task is to develop and evaluate a two-stage information retrieval modelthat given a query returns thenmost relevant documents and then ranks the sentences withinthe documents. For the first part, you should implement a baseline document retriever withtf-idf features. To get full credits, in the second part you should improve over the baseline ofthe document retriever with an advanced approach of your choice. The third part extends themodel to return the ranked sentences. The answer to the query should be found in one of thetop-ranked sentences.In addition to the source code, you should submit a 4-6 page report that describes the problemand why it is interesting/challenging in your own words, the preprocessing steps, the models youhave developed, your evaluation results and an analysis of the results.
Instructions:
- Install dependencies from requirements.txt
- Run Extract.py to parse xml for documents and query and generate preprocessed documents. (Execution time: 3-4 mins)
- Run TF-IDF.py to get the results of all three tasks i.e. TF-IDF(Baseline), BM25Plus and MRR for sentences. (Execution time: 6-8 mins)
Github link: https://github.com/akshayjoshii/Statistical-NLP-Information-Retrieval-Project