TF4ces Search Engine

An experiment driven Search Engine Project, developed to index and retrieve best documents given a query using ensemble of models.

Architecture Diagram

System Design : Search Engine

System Design : Ensemble Model

Retrieval Models

Filter Models
- BM25
- TF-IDF
Voter Models
- MPNET
- RoBERTa

Project Plan

Phase 1
- Data Analysis & Pipeline
- Model Pipeline
- Evaluation Pipeline
Phase 2
- BM25 Model + MPNet Model
- Hyperparameter tuning
- Ensemble Pipeline
Phase 3
- RoBERTa Model
- Ensemble enhancement
- Experimentation

Future works

Finetune ColBERT
Implement Clustering of docs

How to run Project

Note : The project was tested on linux and MacOS. (Windows has dependency issues, refer Troubleshooting)

Clone repository

$ git clone https://github.com/TF4ces/TF4ces-search-engine.git

Setup Environment repository

$ python3 -m venv venv
$ source venv/bin/activate                [LINUX/MAC]
$ .\venv\Scripts\activate                 [WINDOWS]
$ pip install -r src/requirements.txt

Download pre-loaded embeddings to this path: ./dataset/embeddings_test from GDrive

Note: To generate embeddings from scratch run./tests/test_evaluate_model.py script setting MODEL to all-mpnet-base-v2, all-roberta-large-v1 individually twice.

WARNING: use a GPU machine and it is expected to take 1hr to generate.
Run TF4ces Search Engine [install jupyter by $pip install jupyter notebook and to run $jupyter notebook]
1. Run Eval Pipeline from ./tests/notebooks/TF4ces_Search_Eval.ipynb ipynb notebook.
2. Run prediction Demo Pipeline from ./tests/notebooks/TF4ces_Search_Demo.ipynb ipynb notebook.

Troubleshooting :

Windows Systems are seen to have issue while reading data with ir-datasets==0.4.1

For windows the doc.iter might throw decoding error while reading tsv file, You would need to change the encoding in source files of dependency as per this issue.

Issue : allenai/ir_datasets#208 (comment)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

TF4ces Search Engine

Architecture Diagram

System Design : Search Engine

System Design : Ensemble Model

Retrieval Models

Project Plan

Future works

How to run Project

Troubleshooting :

Files

README.md

Latest commit

History

README.md

File metadata and controls

TF4ces Search Engine

Architecture Diagram

System Design : Search Engine

System Design : Ensemble Model

Retrieval Models

Project Plan

Future works

How to run Project

Troubleshooting :