Design and implementation of a search engine using specific information retrieval methods
This project focuses on designing an Information Retrieval (IR) system tailored for news articles. By leveraging advanced IR techniques and models, our goal is to provide a robust and efficient system for retrieving news articles with high relevance and accuracy. Our system is designed to improve the quality of search results, ensuring timely and reliable information dissemination.
We target two levels of complexity in search results:
- General News: The query will be applied across all available articles.
- Sports News: Concentrating on a telescoped dataset, focusing on sports-related articles.
- Source: BBC News articles scraped using Python, categorized under “/news” and “/sports”.
- Period: Articles are scraped over a 10-day window to ensure a diverse range of queries and relevancy.
- Challenges: Overcoming the lack of queries and relevancies through self-labelling and other methods.
Our architecture comprises three main components:
- Document Handler: Processes and indexes documents.
- Query Handler: Manages query preprocessing.
- Retrieval Handler: Executes document retrieval based on query-document relevance.
- Front-End: FastAPI for capturing and responding to user queries.
- Back-End: Python, PostgreSQL, MongoDB, and several libraries like Spacy, Beautiful Soup, psycopg2, pymongo, and rank_bm25.
Focusing on BM25 and DESM models, including a mixture model, to evaluate performance on broad and telescoped datasets.
- Saracevic, T. (2010). Relevance in IR.
- Nalisnick, E. et al. (2016). Improving document ranking with dual word embeddings.