Skip to content

YazanZebak/Search-Engine

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 

Repository files navigation

Search-Engine

Information Retrieval System (Search Engine) that utilizes TF-IDF and Word2Vec models, as well as cosine similarity, to match and rank queries against a collection of documents. The search engine supports multiple preprocessing techniques for English, Arabic and French datasets, providing accurate and relevant search results.

Datasets

WikIR for English.

WikIR for French.

Mr. TyDi for Arabic.

Project Features

Preprocessing: The project incorporates several preprocessing techniques such as tokenization, stemming, lemmatization, and spell checking to enhance the quality of text data.

Inverted Index: The project utilizes an inverted index data structure to efficiently store and retrieve information about the occurrences of words in the documents. This allows for faster searching and retrieval of relevant documents.

TF-IDF Model: The TF-IDF model is used to compute the importance of words in documents and queries, enabling efficient retrieval of relevant information.

Word2Vec Model: The Word2Vec model generates word embeddings that capture the semantic meaning of words, improving the understanding and accuracy of search results.

Cosine Similarity: Cosine similarity is employed to measure the similarity between queries and documents, facilitating effective matching and ranking.

Evaluation of Search Results: The project implement methods to evaluate and rank the search results based on their relevance to the query, using techniques such as precision at k (P@k), recall, mean reciprocal rank (MRR) and mean average precision (MAP).

API with Flask: The search engine is deployed as a Flask-based API that offers various endpoints to interact with the system.

Project Structure

  • datasets/
    • [Dataset files]
  • engine/
    • core/
      • models/
      • preprocess/
      • spell_checker/
    • evaluation/
    • utils/
  • output/
  • server.py

datasets: This directory contains the datasets used in the project.

engine: This directory contains the main engine of the project.

core: This subdirectory contains the core functionality of the engine.

models: This subdirectory contains the models used in the project.

preprocess: This subdirectory contains the preprocessing functionality.

spell_checker: This subdirectory contains the spell checker functionality.

evaluation: This subdirectory contains evaluation scripts and metrics.

utils: This subdirectory contains utility functions and helper modules.

output: This directory is used to store the output files generated by the project.

server.py: This file contains the Flask server code for running the project's services.

Services

Choose Dataset: This service allows you to choose a dataset for processing. It expects a JSON payload with the dataset parameter specifying the dataset name.
Endpoint: /choose-dataset
Method: POST
Payload: { "dataset": "dataset_name" }

Correct: This service performs spell checking on a given query. It expects a JSON payload with the query parameter specifying the input query.
Endpoint: /correct
Method: POST
Payload: { "query": "input_query" }

Search: This service performs a search operation on a given query. It expects a JSON payload with the query parameter specifying the search query.
Endpoint: /search
Method: POST
Payload: { "query": "search_query" }

About

Information Retrieval System (Search Engine)

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages