Information Retrieval and Web Search course project at Concordia University - professor by Dr. Sabine Bergler.
This assignment has 3 stages: P1, P2, and P3.
Python>=3.8 is used as a programming language for this project due to its compatibility with natural language processing tasks, facilitated by the NLTK package.
Project 1 (P1): Text Preprocessing and Proofreading
- Utilize NLTK for text preprocessing, which involves tasks like tokenization and stemming.
- Proofread and ensure the quality of the processed text data.
Project 2 (P2): Indexing and Query Processing
- Implement a naive indexer for indexing documents.
- Develop a mechanism for processing single-term queries.
- Apply lossy dictionary compression techniques to create a compressed indexer.
Project 3 (P3): Performance Analysis and Search Engine Implementation
- Compile and measure the execution time required for constructing both the naive indexer and the SPIMI (Single Pass In-Memory Indexing) indexer.
- Utilize the SPIMI indexer to implement two search engines:
- A Ranked BM25 search engine, which ranks search results based on relevance using the BM25 algorithm.
- An Unranked Boolean search engine, which performs basic Boolean (AND, OR, NOT) queries.
- Reuter’s Corpus "Reuters-21578"
(Visit Original Website)
In this project, pypy3 is used as Python3 executable.
Pypy3 serves as a substitute for the native Python3 interpreter due to its superior runtime performance. Given that these projects involve processing an extensive volume of large files through iterative operations, opting for pypy3 as an alternative interpreter was a highly efficient decision.
$ brew install pypy3
$ pypy3 -m pip install virtualenv
$ pypy3 -m virtualenv pypy3-env
$ cd ~/pypy3-venv/
then $ . bin/activate