diff --git a/index.html b/index.html index a3f1ef3..bdf3ff7 100644 --- a/index.html +++ b/index.html @@ -157,7 +157,7 @@

Executive Summary

Retrieval-augmented generation (RAG) is an AI Question-Answering framework that surfaced in 2020 and that synergizes the capabilities of Large Language Models (LLMs) and information retrieval systems from specific domain of expertise (hereafter “evaluation reports”). This paper is presenting the challenges and opportunities associated with this approach in the context of evaluation. It then suggests a potential solution and way forward.

First, we explain how to create an initial two-pagers evaluation brief using an orchestration of functions and models from Hugging Face Hub. Rather than relying on ad-hoc user interactions through a black-box point & click chat interface, a relevant alternative is to use a data science approach with documented and reproducible scripts that can directly output a word document. The same approach could actually be applied to other textual analysis needs, for instance: extracting causal chains from the transcriptions of Focus Group Discussions, performing Quality Assurance review on key documents, generating potential theories of change from needs assessment reports or assessing sufficient usage of programmatic evidence when developing Strategic Plan for Operation.

-

Second, we review the techniques that can be used to evaluate the performance of summarisation scripts both to optimize them but also to minimize the risk of AI hallucinations. We generate alternative briefs (#2, #3, #4) and then create an specific test dataset to explore the different metrics that can be used to evaluate the information retrieval process.

+

Second, we review the techniques that can be used to evaluate the performance of summarisation scripts both to optimize them but also to minimize the risk of AI hallucinations and misalignment. We generate alternative briefs (#2, #3, #4) and then create an specific test dataset to explore the different metrics that can be used to evaluate the information retrieval process.

Last we discuss how such approach can actually inform decisions and strategies for an efficient AI deployment: While improving RAG pipeline is the first important step, creating training dataset with human-in-the-loop allows to “ground truth” and “fine-tune” an existing model. This not only further increase its performance and but also ensure its reliability both for evidence retrieval and at latter stage for learning across systems and contexts.

A short presentation is also available here


diff --git a/index.qmd b/index.qmd index a6f3da1..8dda06b 100644 --- a/index.qmd +++ b/index.qmd @@ -32,7 +32,7 @@ Retrieval-augmented generation (RAG) is an AI Question-Answering framework that First, we explain how to create an initial [two-pagers evaluation brief](https://github.com/Edouard-Legoupil/rag_extraction/raw/main/generated/Evaluation_Brief_response_recursivecharactertext_bert.docx){target="_blank"} using an orchestration of functions and models from [Hugging Face Hub](https://huggingface.co/docs/hub/index){target="_blank"}. Rather than relying on ad-hoc user interactions through a *black-box point & click* chat interface, a relevant alternative is to use a data science approach with documented and **reproducible scripts** that can directly output a word document. The same approach could actually be applied to other textual analysis needs, for instance: extracting causal chains from the transcriptions of Focus Group Discussions, performing Quality Assurance review on key documents, generating potential theories of change from needs assessment reports or assessing sufficient usage of programmatic evidence when developing Strategic Plan for Operation. -Second, we review the techniques that can be used to **evaluate the performance** of summarisation scripts both to optimize them but also to minimize the risk of AI hallucinations. We generate alternative briefs ([#2](https://github.com/Edouard-Legoupil/rag_extraction/raw/main/generated/Evaluation_Brief_response_mmr_recursivecharactertext_bge.docx){target="_blank"}, [#3](https://github.com/Edouard-Legoupil/rag_extraction/raw/main/generated/Evaluation_Brief_response_parent_recursivecharactertext_bge.docx){target="_blank"}, [#4](https://github.com/Edouard-Legoupil/rag_extraction/raw/main/generated/Evaluation_Brief_response_ensemble_recursivecharactertext_bge.docx){target="_blank"}) and then create an specific [test dataset](https://github.com/Edouard-Legoupil/rag_extraction/tree/main/dataset){target="_blank"} to explore the different metrics that can be used to evaluate the information retrieval process. +Second, we review the techniques that can be used to **evaluate the performance** of summarisation scripts both to optimize them but also to minimize the risk of AI hallucinations and misalignment. We generate alternative briefs ([#2](https://github.com/Edouard-Legoupil/rag_extraction/raw/main/generated/Evaluation_Brief_response_mmr_recursivecharactertext_bge.docx){target="_blank"}, [#3](https://github.com/Edouard-Legoupil/rag_extraction/raw/main/generated/Evaluation_Brief_response_parent_recursivecharactertext_bge.docx){target="_blank"}, [#4](https://github.com/Edouard-Legoupil/rag_extraction/raw/main/generated/Evaluation_Brief_response_ensemble_recursivecharactertext_bge.docx){target="_blank"}) and then create an specific [test dataset](https://github.com/Edouard-Legoupil/rag_extraction/tree/main/dataset){target="_blank"} to explore the different metrics that can be used to evaluate the information retrieval process. Last we discuss how such approach can actually inform decisions and strategies for an **efficient AI deployment**: While improving RAG pipeline is the first important step, creating training dataset with human-in-the-loop allows to "ground truth" and "fine-tune" an existing model. This not only further increase its performance and but also ensure its reliability both for evidence retrieval and at latter stage for learning across systems and contexts.