Skip to content

Proposal for industry RAG evaluation: Generative Universal Evaluation of LLMs and Information retrieval

Notifications You must be signed in to change notification settings

Gian207/RAG-lego-like-component

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 
 
 
 
 

Repository files navigation

RAG-lego-like-component

A RAG system consists of a retriever and a large language model. In response to a user query, the retriever identifies most relevant content from a corpus, which the LLM then uses to generate a response. This formulation allows for a multitude of design choices, including the selection of an appropriate retrieval method, chunk size, splitting technique and embedding model. It is not always the case that a RAG system will perform optimally in all contexts. Its efficacy may vary depending on the specific data domain, corpus size and cost budget in question. Therefore, the creation of a universal architecture for all use cases is not a viable solution; rather, it is essential to analyze each case individually in order to identify the most suitable parameters.

To simplify the creation of the RAG system, a modular, Lego-like design was developed.

Meme generation

Rag-eval

A RAG systems evaluation is done using fixed benchmark datasets. However, when a benchmark is not available, it must be created from scratch using the private data.

Benchmark creation proposal: Generative Universal Evaluation LLM and Information retrieval

To accomplish this effectively, we acknowledge that manually creating hundreds of question-context-answer examples from documents is a time-consuming and labor-intensive task. Therefore, utilizing a large language model (LLM) to generate synthetic test data offers a more efficient solution that reduces both time and effort.

Screenshot 2024-12-15 alle 20 20 38

Therefore, to facilitate the benchmark creation process, an automated method was developed.

At the time of writing, various frameworks exist for creating synthetic training data. The main one is RAGAS, but is not the optimal choice for smaller LLMs, as they are not always able to generate the requisite output in the desired JSON format. In order to solve this problem, the following approach was developed.

Algoritm

We first split the text document into a given chunk_size using the recursive text splitter method and then we store these in in-memory qdrantDB vector store. Now, for each context that is larger than 100 characters, we create a question using an LLM. Then we test if a given query can retrieve its original passage as the top result using its retrieve: If the retriever returns the original context where the question came from, we assign a score of 1, 0 otherwise. We create a benchmark with the following structure: ID: each rows represent a context with a unique id, in order to check if the retrieved content is the same as the contenxt. • question: the generated query by the LLM starting from the context; • context: the spitted document is considered as the context provided if the lenght of the chunks is greater than a treeshold; • retrieved_content: the content retrieved from the system given the query; • retrieved_id: the chunk ID retrieved by the retriever system; • score: 1 if the retrieved context is equals to the context, 0 otherwise;

Algoritm for benchmark generation

About

Proposal for industry RAG evaluation: Generative Universal Evaluation of LLMs and Information retrieval

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages