search_engine

Design and implementation of a search engine using specific information retrieval methods

News Article Information Retrieval System

Introduction

This project focuses on designing an Information Retrieval (IR) system tailored for news articles. By leveraging advanced IR techniques and models, our goal is to provide a robust and efficient system for retrieving news articles with high relevance and accuracy. Our system is designed to improve the quality of search results, ensuring timely and reliable information dissemination.

System Design

We target two levels of complexity in search results:

General News: The query will be applied across all available articles.
Sports News: Concentrating on a telescoped dataset, focusing on sports-related articles.

Dataset

Source: BBC News articles scraped using Python, categorized under “/news” and “/sports”.
Period: Articles are scraped over a 10-day window to ensure a diverse range of queries and relevancy.
Challenges: Overcoming the lack of queries and relevancies through self-labelling and other methods.

Architecture

Our architecture comprises three main components:

Document Handler: Processes and indexes documents.
Query Handler: Manages query preprocessing.
Retrieval Handler: Executes document retrieval based on query-document relevance.

Software and Tools

Front-End: FastAPI for capturing and responding to user queries.
Back-End: Python, PostgreSQL, MongoDB, and several libraries like Spacy, Beautiful Soup, psycopg2, pymongo, and rank_bm25.

Investigation

Focusing on BM25 and DESM models, including a mixture model, to evaluate performance on broad and telescoped datasets.

References

Saracevic, T. (2010). Relevance in IR.
Nalisnick, E. et al. (2016). Improving document ranking with dual word embeddings.

Name		Name	Last commit message	Last commit date
Latest commit History 51 Commits
crawler		crawler
helpers		helpers
indexing		indexing
inputs/centroids		inputs/centroids
models		models
templates		templates
.gitignore		.gitignore
README.md		README.md
index.html		index.html
main.py		main.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

search_engine

News Article Information Retrieval System

Introduction

System Design

Dataset

Architecture

Software and Tools

Investigation

References

About

Releases

Packages

Contributors 2

Languages

melkemaryam/search_engine

Folders and files

Latest commit

History

Repository files navigation

search_engine

News Article Information Retrieval System

Introduction

System Design

Dataset

Architecture

Software and Tools

Investigation

References

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages