This repository provides a step-by-step walkthrough of the RAG (Retrieval-Augmented Generation) pipeline codebase
The pipeline is implemented using a series of Jupyter notebooks. Follow the steps below to understand and run the pipeline.
Before you begin, ensure you have the following installed:
-
Python 3.11.11
-
Jupyter Notebook
-
Required Python packages (listed in
requirements.txt
) -
remove
.example
from.env.example
and fill in the required values
-
Clone the Repository
git clone https://github.com/devzohaib/RAG_Pipeline.git cd RAG_Pipeline
-
Install Dependencies
pip install -r requirements.txt
Notebook: 1-Data_Collection.ipynb
- Objective: Prepare and preprocess the dataset for the RAG pipeline.
- Steps:
- Load the dataset.
- Clean and preprocess the text data.
- Save the processed data for further use.
Notebook: 2-Data_Embedding_and_Storage.ipynb
- Objective: Creating Embedding for process dataset and Store Embedding into VectorStore
- Steps:
- Load the Batch of process data.
- Creating the Embedding of data using
test-embedding-3-small
OpenAI embedding model . - Adding Data to the VectorStore.