Skip to content

Latest commit

 

History

History
79 lines (54 loc) · 2.58 KB

README.md

File metadata and controls

79 lines (54 loc) · 2.58 KB

TF4ces Search Engine

An experiment driven Search Engine Project, developed to index and retrieve best documents given a query using ensemble of models.

Architecture Diagram

System Design : Search Engine

img.png

System Design : Ensemble Model

img.png

Retrieval Models

  • Filter Models
    • BM25
    • TF-IDF
  • Voter Models
    • MPNET
    • RoBERTa

Project Plan

  • Phase 1
    • Data Analysis & Pipeline
    • Model Pipeline
    • Evaluation Pipeline
  • Phase 2
    • BM25 Model + MPNet Model
    • Hyperparameter tuning
    • Ensemble Pipeline
  • Phase 3
    • RoBERTa Model
    • Ensemble enhancement
    • Experimentation

Future works

  • Finetune ColBERT
  • Implement Clustering of docs

How to run Project

Note : The project was tested on linux and MacOS. (Windows has dependency issues, refer Troubleshooting)

  1. Clone repository

    $ git clone https://github.com/TF4ces/TF4ces-search-engine.git
  2. Setup Environment repository

    $ python3 -m venv venv
    $ source venv/bin/activate                [LINUX/MAC]
    $ .\venv\Scripts\activate                 [WINDOWS]
    $ pip install -r src/requirements.txt 
  3. Download pre-loaded embeddings to this path: ./dataset/embeddings_test from GDrive

    Note: To generate embeddings from scratch run./tests/test_evaluate_model.py script setting MODEL to all-mpnet-base-v2, all-roberta-large-v1 individually twice.

    WARNING: use a GPU machine and it is expected to take 1hr to generate.

  4. Run TF4ces Search Engine [install jupyter by $pip install jupyter notebook and to run $jupyter notebook]

    1. Run Eval Pipeline from ./tests/notebooks/TF4ces_Search_Eval.ipynb ipynb notebook.
    2. Run prediction Demo Pipeline from ./tests/notebooks/TF4ces_Search_Demo.ipynb ipynb notebook.

Troubleshooting :

  1. Windows Systems are seen to have issue while reading data with ir-datasets==0.4.1

    For windows the doc.iter might throw decoding error while reading tsv file, You would need to change the encoding in source files of dependency as per this issue.

    Issue : allenai/ir_datasets#208 (comment)