Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimization techniques #39

Merged
merged 2 commits into from
Jul 30, 2024
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
281 changes: 241 additions & 40 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -43,70 +43,271 @@ poetry install
```

## Approach
I followed the following stesp to develop the RAG system and later perform optimization.

I followed the following stesp to develop the RAG system and later perform
optimization.

1. Project setup
2. Data preparation and loading
3. RAG system setup.
4. Evaluation pipeline setup using RAGAS.
5. Run and analyze baseline benchmark evaluation.
5. Identify areas of improvement.
6. Identify areas of improvement.
7. Identify optimization techniques.
6. Implement optimization techniques.
8. Implement optimization techniques.

### Project setup
I created a new project using poetry and added the necessary dependencies i.e Lanchain tools and RAGAS.

I created a new project using poetry and added the necessary dependencies i.e
Lanchain tools and RAGAS.

### Data preparation and loading

I used the CNN/Daily Mail dataset for this project. The dataset is available on the Hugging Face datasets library. I loaded the dataset using the `datasets` library and extracted the necessary fields for the RAG system.
I used the CNN/Daily Mail dataset for this project. The dataset is available on
the Hugging Face datasets library. I loaded the dataset using the `datasets`
library and extracted the necessary fields for the RAG system.

`dataset = load_dataset("cnn_dailymail", "3.0.0", split="validation[:1000]")`

```dataset = load_dataset("cnn_dailymail", "3.0.0", split="validation[:1000]")```
The line above loads the first 1000 examples from the validation split of the
CNN/Daily Mail dataset. The function to do this can found under
`src/rag_pipeline/load_docs.py`

The line above loads the first 1000 examples from the validation split of the CNN/Daily Mail dataset.
The function to do this can found under `src/rag_pipeline/load_docs.py`
## RAG system setup

### RAG system setup
#### Basic Rag system
Having some experience with using ChromaDB vectorstore, I decided to use it for the initial setup of the RAG system.
### Basic Rag system

Having some experience with using ChromaDB vectorstore, I decided to use it for
the initial setup of the RAG system.

I used the steps to setup my basic RAG system as follows:
1. Load documents: I loaded the dataset from csv file, I then retrieved the `article` column only for use as page_content to get my documents.
2. Split documents: Using langchain `RecursiveChararacterTestSplitter`, I split the documents into small chunks.
3. Create vectorstore: I used `langchain_chroma` to create a vectorstore from the split documents.
4. Setup LLM: I used OpenAI's gpt-3.5-turbo for testing the setup. I would then upgrade to gpt-4o when ready.
5. Create RAG chain that can be used to retrieve documents and generate answers. The RAG chain was simple using [`RetrievalQA`](https://docs.smith.langchain.com/old/cookbook/hub-examples/retrieval-qa-chain) from langchain.

#### Advancing the RAG system with best practices
I followed these steps to setup the RAG system and make it reusable and scalable:
1. Created a class `RAGSystem` that would be used to setup the RAG system. The class can be found under `src/rag_pipeline/rag_system.py`
2. Added the methods and classes i.e to load documents, split documents, create vectorstore, setup LLM, create RAG chain and more.

1. Load documents: I loaded the dataset from csv file, I then retrieved the
`article` column only for use as page_content to get my documents.
2. Split documents: Using langchain `RecursiveChararacterTestSplitter`, I split
the documents into small chunks.
3. Create vectorstore: I used `langchain_chroma` to create a vectorstore from
the split documents.
4. Setup LLM: I used OpenAI's gpt-3.5-turbo for testing the setup. I would then
upgrade to gpt-4o when ready.
5. Create RAG chain that can be used to retrieve documents and generate answers.
The RAG chain was simple using
[`RetrievalQA`](https://docs.smith.langchain.com/old/cookbook/hub-examples/retrieval-qa-chain)
from langchain.

### Advancing the RAG system with best practices

I followed these steps to setup the RAG system and make it reusable and
scalable:

1. Created a class `RAGSystem` that would be used to setup the RAG system. The
class can be found under `src/rag_pipeline/rag_system.py`
2. Added the methods and classes i.e to load documents, split documents, create
vectorstore, setup LLM, create RAG chain and more.
3. Usage: I could import the class and initialize as follows:
```
from src.rag_pipeline.rag_system import RAGSystem

rag_system = RAGSystem(
model_name = "gpt-4o",
embeddings = embeddings,
# Here you can add more parameters to customize the RAG system
)
```
from src.rag_pipeline.rag_system import RAGSystem

rag_system = RAGSystem(
model_name = "gpt-4o",
embeddings = embeddings,
# Here you can add more parameters to customize the RAG system
)

rag_system.initialize()
```

### Create custom rag_chain:

I created a custom rag_chain that would be used to retrieve documents and
generate to allow customazibality over the RetrievalQA chain. The custom
rag_chain can be found under `src/rag_pipeline/rag_utils.py`

These are the steps I followed to create the custom rag_chain:

- Defining Helper Functions: Two helper functions are defined: `format_docs`,
which formats a list of documents into a concatenated string, and
`ragas_output_parser`, which extracts page content from a list of documents.

- **Custom Prompt Templates for generator llm**: I created custom prompt
template `GENERATOR_TEMPLATE` in the settings (`misc/settings.py`). This
template is then combined with a language model (`llm`) and a string output
parser to form the generator component of the RAG chain.

rag_system.initialize()
```
- **Creating Context Retriever**: A `RunnableParallel` Langchain object named
`context_retriever` is set up to handle the retrieval of relevant documents.
It combines the retriever with the `format_docs` function to fetch and format
the context, while passing through the question as-is.

- **Filtering Dataset**: A `RunnableLambda` Langchain object named
`filter_langsmith_dataset` is created to filter the input, ensuring that only
the question is processed if the input is a dictionary. Note this function was
initially used for RAGAS+LangSmith evaluation, however it works well for any
dataset.

- **Constructing the RAG Chain**: The final RAG chain is constructed as a
`RunnableParallel` Langchain object. It does the following: - processes the
question through the filter - retrieves and formats the context, - generates
an answer using the generator, and - extracts contexts using the
`ragas_output_parser`.

- **Returning the RAG Chain**: The function returns the constructed
RunnableParallel object, representing the complete RAG chain setup for
LangSmith integration.

## Integrating pgvector for vectordatabase

I decided to integrate pgvector vectorstore for improved performance. I followed
the steps below to integrate pgvector:

#### Integrating pgvector for vectordatabase
I decided to integrate pgvector vectorstore for improved performance.
I followed the steps below to integrate pgvector:
1. Setup pgvector database:
- Install the necessary dependencies using poetry for pgvector including `langchain-pgvector` and `pgvector`.
- Using docker, I installed pgvector database which uses postgresql as the database.
- I created a docker-compose file to install the database. The file can be found under `docker-compose.yml` containing the pgvector service and the database service.
- Create a script to create `vector` extension and create embeddings table. The script is under `scripts/init.sql`. However, when using langchain-pgvector, the script is not necessary as the library will create the table and extension for us.
- I started the database using the command `docker compose up -d`.
- I wrote a make target to save this command. The target can be found under `Makefile` as `up`. Other commands can be found under the `Makefile` as well. The `Makefile` allows me to easily document and run commands critical to the project.

2. Add pgvector vectorstore to the RAG system
- Install the necessary dependencies using poetry for pgvector including
`langchain-pgvector` and `pgvector`.
- Using docker, I installed pgvector database which uses postgresql as the
database.
- I created a docker-compose file to install the database. The file can be
found under `docker-compose.yml` containing the pgvector service and the
database service.
- Create a script to create `vector` extension and create embeddings table.
The script is under `scripts/init.sql`. However, when using
langchain-pgvector, the script is not necessary as the library will create
the table and extension for us.
- I started the database using the command `docker compose up -d`.
- I wrote a make target to save this command. The target can be found under
`Makefile` as `up`. Other commands can be found under the `Makefile` as
well. The `Makefile` allows me to easily document and run commands critical
to the project.

2. Add pgvector vectorstore to the RAG system. Implementation and example usage
from langchain docs found
[here](https://python.langchain.com/v0.2/docs/integrations/vectorstores/pgvector/).

Since I had chroma vectorstore setup. It was easy to replace it with pgvector
when using langchain. Both can be initialized in a similar manner. Let's look
at the examples.

```Python
chroma_vectorstore = Chroma(
client=persistent_client,
collection_name="collection_name",
embedding_function=embedding_function,
)
```

Pgvector:

```Python
pgvector_vectorstore = PGVector(
embeddings=embeddings,
collection_name=collection_name,
connection=connection,
use_jsonb=True,
)
```

- To make my pgvector complete, I added Connection string to the `.env` file.
The connection string is used to connect to the pgvector database. This
connection string uses the details from the `docker-compose.yml` file under
the `pgvector` service `environments` section.

3. I then added the pgvector vectorstore to the RAG system. The vectorstore can
be found under the rag_system.py file.

## Generating evaluation Q&A pairs for RAGAS evaluation

To come up with q&a pairs that are diverse in complexity, I used RAGAS function
to generate q&a pairs. I followed the steps below to generate the q&a pairs:

1. Setting up LLMs:
- OpenAI's gpt-3.5-turbo was initialized as the generator llm.
- OpenAI's gpt-4o was initialized as the critic llm.
2. Determine the distributions in complexity of the q&a pairs.
- Ragas provides three distributions that can be tweaked i.e `simple`,
`multi-context`, and 'reasoning`.
- I used the distributions:
`simple: 0.5, multi-context: 0.25, and reasoning: 0.25`
3. Generate the q&a pairs.
- To generate the evaluation set, I decided to use the
`generate_with_langchain_docs` function from ragas, the `distributions`,
`llms`, and `20` samples with `20` documents.

## Evaluation pipeline setup using RAGAS

I used [RAGAS](https://docs.ragas.io/en/latest/getstarted/evaluation.html) to
evaluate the RAG system.

At some point, I opted to use Langsmith to trace my evaluations and to store the
results. However, with Langsmith, many things were not clear. The documentation
on ragas website was empty. I therefore opted to build my own RAGAS pipeline and
save the results in csv files.

I followed the steps below to setup the evaluation pipeline:

1. Installed RAGAS using poetry.
2. I started with the simple setup from the RAGAS documentation
[here](https://docs.ragas.io/en/latest/getstarted/evaluation.html).
3. Setup the utility functions to load the dataset and for evaluation using
RAGAS+Langsmith, to upload csv to Langsmith. They can be found under
`src/ragas_pipeline/ragas_utils.py`.
4. **Getting contexts and answers**: I then created a function to get the
contexts and answers for the questions in the evaluation q&a pairs.

- The function can be found under `src/ragas/ragas_pipeline.py`
- It receieves the evaluation q&a pairs and the rag_chain and uses the
rag_chain to get the contexts and answers.

5. **Evaluation pipeline**: I then created a function to run the evaluation
pipeline. The function can be found under `src/ragas/ragas_pipeline.py`. This
is what the function does:
- You begin by defining key metrics like answer correctness, faithfulness,
answer relevancy, and context precision, and then load the evaluation data.
- Choosing Evaluation Method: You decide whether to evaluate using LangSmith
or locally on your machine.
- Using Langsmith: If using LangSmith, we ensure the dataset name is provided
alongside the experiment name, upload the dataset if needed, and then
evaluate the RAG chain on LangSmith, which will show the results on the
LangSmith dashboard.
- Evaluating Locally: If evaluating locally, the function
`get_contexts_and_answers` is used which uses rag_chain as mentioned in the
last step. It then evaluates process using ragas against the predefined
metrics.
- Converting and saving results: After evaluation, the results are converted
into a pandas DataFrame. If saving locally, a directory is created if
needed and the results saved as a CSV file.

## How to run a benchmark on the RAG system using RAGAS
Running the evaluation pipeline using ragas is fairly simple.
Assuming we have initialized the RAG system in this manner as seen in the section on RAG system setup above:

```Python
from src.rag_pipeline.rag_system import RAGSystem

rag_system = RAGSystem(
model_name = "gpt-4o",
embeddings = embeddings,
# Here we can add more parameters to customize the RAG system
)

rag_system.initialize()
```

We can then run the evaluation pipeline as follows, providing the `rag_chain` initialized in the instance of RAGsystem above:

```Python
from src.ragas.ragas_pipeline import run_ragas_evaluation

rag_results = run_ragas_evaluation(
rag_chain=rag_system.rag_chain,
save_results=True,
experiment_name="embedding_model_bge_large"
)
```

The function will run the evaluation pipeline and save the results in a csv file with the `experiment_name` being used to name the csv results file.

## The results of the baseline benchmark evaluation



Expand Down
Loading