Skip to content

Text preprocessing, indexer constructions, and search engines implementation for information retrieval. Performance analysis done by measuring the construction time of indexers.

Notifications You must be signed in to change notification settings

chihiroanihr/COMP479_F2022

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

86 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

COMP479_Fall2022

Information Retrieval and Web Search course project at Concordia University - professor by Dr. Sabine Bergler.

Overview

This assignment has 3 stages: P1, P2, and P3.

Built with Python

Python>=3.8 is used as a programming language for this project due to its compatibility with natural language processing tasks, facilitated by the NLTK package.

Project 1 (P1): Text Preprocessing and Proofreading

Key Tasks

  • Utilize NLTK for text preprocessing, which involves tasks like tokenization and stemming.
  • Proofread and ensure the quality of the processed text data.

Resources

Project 2 (P2): Indexing and Query Processing

Key Tasks

  • Implement a naive indexer for indexing documents.
  • Develop a mechanism for processing single-term queries.
  • Apply lossy dictionary compression techniques to create a compressed indexer.

Resources

Project 3 (P3): Performance Analysis and Search Engine Implementation

Key Tasks

  • Compile and measure the execution time required for constructing both the naive indexer and the SPIMI (Single Pass In-Memory Indexing) indexer.
  • Utilize the SPIMI indexer to implement two search engines:
    • A Ranked BM25 search engine, which ranks search results based on relevance using the BM25 algorithm.
    • An Unranked Boolean search engine, which performs basic Boolean (AND, OR, NOT) queries.

Resources

Dataset Used

Setup

In this project, pypy3 is used as Python3 executable.

Pypy3 serves as a substitute for the native Python3 interpreter due to its superior runtime performance. Given that these projects involve processing an extensive volume of large files through iterative operations, opting for pypy3 as an alternative interpreter was a highly efficient decision.

Install pypy3

$ brew install pypy3

Install virtualenv

$ pypy3 -m pip install virtualenv

Create a PyPy virtualenv in the directory pypy-venv

$ pypy3 -m virtualenv pypy3-env

Start working in the virtual environment

$ cd ~/pypy3-venv/ then $ . bin/activate