Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimization techniques #40

Merged
merged 13 commits into from
Jul 30, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
166 changes: 161 additions & 5 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -243,8 +243,17 @@ results. However, with Langsmith, many things were not clear. The documentation
on ragas website was empty. I therefore opted to build my own RAGAS pipeline and
save the results in csv files.

I followed the steps below to setup the evaluation pipeline:
### Choice of RAGAS metrics for evaluation
I prioritized the following metrics for evaluation in that order:
1. Answer Correctness: How accurate the answer is compared to the ground truth.
2. Faithfulness: How well the answer aligns with the facts in the given context.
3. Answer Relevancy: How well the answer addresses the question asked.
4. Context Precision: How relevant the retrieved information is to the question.

More metrics and their explanations can be found on ragas documentation [here](https://docs.ragas.io/en/stable/concepts/metrics/index.html).


### Steps followed to setup the evaluation pipeline
1. Installed RAGAS using poetry.
2. I started with the simple setup from the RAGAS documentation
[here](https://docs.ragas.io/en/latest/getstarted/evaluation.html).
Expand Down Expand Up @@ -278,8 +287,10 @@ I followed the steps below to setup the evaluation pipeline:
needed and the results saved as a CSV file.

## How to run a benchmark on the RAG system using RAGAS
Running the evaluation pipeline using ragas is fairly simple.
Assuming we have initialized the RAG system in this manner as seen in the section on RAG system setup above:

Running the evaluation pipeline using ragas is fairly simple. Assuming we have
initialized the RAG system in this manner as seen in the section on RAG system
setup above:

```Python
from src.rag_pipeline.rag_system import RAGSystem
Expand All @@ -293,7 +304,8 @@ Assuming we have initialized the RAG system in this manner as seen in the sectio
rag_system.initialize()
```

We can then run the evaluation pipeline as follows, providing the `rag_chain` initialized in the instance of RAGsystem above:
We can then run the evaluation pipeline as follows, providing the `rag_chain`
initialized in the instance of RAGsystem above:

```Python
from src.ragas.ragas_pipeline import run_ragas_evaluation
Expand All @@ -305,12 +317,156 @@ We can then run the evaluation pipeline as follows, providing the `rag_chain` in
)
```

The function will run the evaluation pipeline and save the results in a csv file with the `experiment_name` being used to name the csv results file.
The function will run the evaluation pipeline and save the results in a csv file
with the `experiment_name` being used to name the csv results file.

## The results of the baseline benchmark evaluation

The baseline benchmark evaluation was run using the RAG system with the following configurations:
- Model: GPT-4o
- Embeddings: OpenAIEmbeddings (text-embeddings-ada-002)
- Vectorstore: pgvector
- Chunking strategy: RecursiveCharacterTestSplitter, chunk_size=1000, overlap=200
- Ragchain - RetrievalQA with the default prompt

### Summary statistics
Since the metrics were all of type float64, I could carry out numerical calculations. i calculated the summary statistics i.e mean, standard deviation and creating visualizations to understand the performance of the RAG system.

Below is a boxplot of the summary statistics:

![baseline-benchmark-results](screenshots/results/baseline_benchmark_visualization.png)

Key observations from the summary statistics and boxplots:

- **Answer Correctness**: The average answer correctness is `0.689`, suggesting that the system generates reasonably accurate answers most of the time. However, there's a wide range `(0.23 to 1)`, indicating that the accuracy can vary significantly depending on the question. The standard deviation of 0.189 also supports this observation.

- **Faithfulness**: The system excels in faithfulness, with a high average score of `0.863` and `75%` of the values at the maximum of 1. This indicates that the generated answers are generally consistent with the provided context.

- **Answer Relevancy**: The average answer relevancy is 0.847, suggesting that the answers are mostly relevant to the questions. However, there are a few instances where the relevancy is 0, indicating that the system might sometimes generate irrelevant responses. The standard deviation of 0.292 also indicates a relatively wide range of relevancy scores.

- **Context Precision**: The system performs exceptionally well in context precision, with an average score of 0.98 and most values concentrated near 1. This suggests that the system is highly effective at retrieving relevant context for answering questions.


### Further analysis
I noted the following observations when manually going through each question at the results:



## Identifying areas of improvement

## Optimization techniques

To work on the identified areas of improvement, I planned on implementing different optimization techniques. The techniques I planned to implement include:

* **Prompt Engineering**: Experiment with different prompt formats to guide the model towards generating more factually accurate and relevant answers. This could improve answer_correctness, faithfulness, and answer_relevancy.
* **Use of Hybrid Retrievals**: Combining dense (e.g., neural) and sparse (e.g., BM25) retrieval models can leverage the strengths of both approaches, leading to improved document retrieval performance.
* **Using Multiquery retriever**: By generating multiple queries, this could improve help find documents that might have been missed due to subtle differences in wording or imperfect vector representations.
* **Experimenting with different embedding models**: Using advanced embeddings such as BERT or other transformer-based models for context representation can improve the quality of the retrieved documents, leading to better answer generation.
* **Reranking mechanism**: After initial retrieval, re-rank the retrieved documents based on additional factors like relevance scores. This could help prioritize the most relevant documents, improving answer_correctness and answer_relevancy.
* **Improved Chunking Strategy**: Optimizing chunk size and overlap parameters can help in capturing more coherent and contextually relevant document sections, thereby improving the quality of the retrieved context.


## Implementation of optimization techniques
### Prompt Engineering

Below is a boxplot comparison of statistical analysis against the baseline benchmarks:

![baseline-benchmark-results](screenshots/results/baseline_benchmark_visualization.png)


### Hybrid Retrievals
I implemented a hybrid retrieval mechanism that combines dense and sparse retrieval models to improve document retrieval performance.

The hybrid retrieval mechanism uses a combination of BM25 and the base retriever from the RAG system to retrieve documents. The BM25 retriever is used to retrieve the top `k` documents based on the BM25 score, and the base retriever is used to retrieve additional documents. The documents retrieved by both retrievers are then combined and ranked based on relevance scores.

The ranking is done by Reciprocal Rank Fusion (RRF), which combines the relevance scores from both retrievers to rank the documents. The RRF score is calculated as the reciprocal of the sum of the ranks of the documents retrieved by both retrievers.

Below is a boxplot comparison of statistical analysis against the baseline benchmarks:

![baseline-benchmark-results](screenshots/results/baseline_benchmark_visualization.png)

### Use of Multiquery retriever
- I implemented a multiquery retriever that generates multiple queries to retrieve documents.
- The multiquery retriever generates `n` queries using llm(gpt-3.5-turbo) based on the question.
- It then retrieves documents for each query using the base retriever from the RAG system.
- The documents retrieved by each query are then combined and ranked.

Below is a boxplot comparison of statistical analysis against the baseline benchmarks:

![baseline-benchmark-results](screenshots/results/baseline_benchmark_visualization.png)

### Chunking Strategy
- I experimented with a different chunking strategy which could improve the quality of the retrieved context, enhancing the context_precision and hence the answer generation.
- With `RecursiveCharacterTestSplitter`, I used a chunk size of 500 and overlap of 100.

These were the results of the chunking strategy:

![chunking-strategy-analysis](screenshots/results/chunk_500_overlap_100_visualization.png)

* **Answer Correctness**: This increased significantly from `0.689` in the baseline to `0.711` in the chunked configuration.
- This suggests that smaller chunks might be more effective in guiding the model to generate factually accurate answers.
- The standard deviation also decreased (as seen from the image with smaller wicks), indicating more consistent accuracy in the chunked results.

* **Faithfulness**: The average faithfulness also increased notably from 0.863 to 0.905.
- This implies that the chunked configuration, with smaller context windows, might help the model generate answers that are more aligned with the factual information in the context.
- The standard deviation also decreased, indicating more consistent faithfulness in the chunked results.

* **Answer Relevancy**: The average answer relevancy improved from 0.847 to 0.938.
- This suggests that smaller chunks are more effective in guiding the model to generate answers that are relevant to the questions.
- The standard deviation also decreased significantly, indicating much more consistent relevancy in the chunked results.

* **Context Precision**: The average context precision slightly decreased from 0.980 to 0.955.
- This suggests that smaller chunks might lead to slightly less precise context retrieval compared to larger chunks. However, the chunked configuration still maintains a high average context precision.

Overall, reducing the chunk size and overlap seems to have a positive impact on all metrics except for context precision, which experienced a minor decrease. This suggests that smaller chunks might be a more effective strategy for this particular RAG system and dataset.

**Recommendations**: Explore different combinations of chunk size and overlap to find the optimal configuration for this specific RAG system and dataset.

### Experimenting with different embedding models
- I experimented with different embedding models to improve the quality of the retrieval, hence improve answer generation.
- I used embedding models from huggingface library and OpenAI text embedding models.
- For OpenAI `text-embeddings-ada-002` is the default model and hence results of this are the baseline results. I tried `text-embedding-3-large` as well.
- For huggingface models, I tried out the following:
- `BAAI/bge-small-en-v1.5` 384 dim - Fast and Default English model, (0.067GB in size)
- `BAAI/bge-large-en-v1.5` 1024 dim - Large English model, v1.5 1.200 (1.2GB in size)

These are the comparison of the results of the different embedding models:

![embedding-models-analysis](screenshots/results/embedding_models_evaluation.png)

From the analysis we can deduce the following:
* **Baseline_ada2_openai**: This model serves as our baseline, achieving the highest scores in answer correctness and faithfulness. It indicates a strong capability to generate factually accurate and contextually consistent answers.

* **all-mpnet-v2**: This model demonstrates a well-rounded performance, exhibiting competitive scores across all metrics. It particularly excels in answer relevancy, suggesting its ability to understand user intent and generate pertinent responses. This model is often pre-trained on a massive amount of web text, which might contribute to its strong semantic understanding.

* **bge-large and bge-small**: These models seem to be less effective compared to the others. The "large" version performs slightly better than the "small" one, which is expected due to its larger size and capacity. However, both struggle with answer correctness and faithfulness, indicating potential difficulties in capturing factual information and maintaining consistency with the context. This could be due to limitations in their pre-training data or architecture.

### Future improvements
If I had more time, I would consider the following additional optimization techniques:

- **Fine-tuning the Language or embedding Models**: Fine-tune the models on a dataset specifically designed for the domain we had, (in this case the CNN/Daily Mail dataset). This could improve answer_correctness and faithfulness by tailoring the model's knowledge and response style.
- **Adaptive Chunking**: Using adaptive chunking strategies that vary chunk size based on document type and content length can improve context relevance and coherence.
* **Agentic chunking**:



### Using open source model for CrossEncoderReranking.

The embeddings I used were from the `sentence-transformers` library. I used
embeddings the model `sentence-transformers/msmarco-distilbert-dot-v5`

The drawbacks:

- The reranker was quite slow on average it used 22 seconds in retrieval as seen
in the
Langsmith![cross-encoder-reranking-opensource-model-langsmith-traces](screenshots/langsmith-tracing-opensource-rerankerScreenshot%20from%202024-07-30%2006-34-14.png)
- During the evaluation with ragas, the entire process took 15 minutes.

- My assumptions were that since the model is from huggingface and is running locally, therefore the it would use cpu to carry out the operations making it slow. I believe this can be improved by hosting the and using gpu.
- Another reason to support this is that I used, bge-raranker-base which is 1.1GB in size.
When I throttled the CPU this got slower, upto 110 seconds.

###
## License

This project is licensed under the MIT License - see the [LICENSE](LICENSE) file
Expand Down
467 changes: 467 additions & 0 deletions notebooks/benchmark analysis/compare_benchmarks.ipynb

Large diffs are not rendered by default.

785 changes: 0 additions & 785 deletions notebooks/optimization_techniques/compare_benchmarks.ipynb

This file was deleted.

Loading
Loading