EPIC: LeapfrogAI Evaluations v1.1 #1171

jalling97 · 2024-10-01T14:23:53Z

LeapfrogAI Evaluations v1.1

Description

Now that a baseline evaluations framework for LeapfrogAI exists, it needs to be further expanded to meet the needs of the product and mission-success teams.

Feedback has been provided with the common themes needing to be addressed:

Some evaluations (primarily NIAH) are always passing 100% and as such, are not helpful for tracking growth over time
Some NIAH and QA evals are not leveraging the full chunk data in RAG responses and as such are not evaluating RAG to the extent it should be
Evaluation results are not currently being stored anywhere
The current implementation of LFAI evals is very specific to the OpenAI way of handling RAG, and therefore the evaluations can't be run against custom RAG pipelines (a delivery concern).
MMLU results suspiciously sometimes return the same score for multiple topics, indicating a potential problem with the evaluation 🐛

Completion Criteria

Create a new Needle In a Haystack Dataset for more difficult evaluations
- feat: RAG NIAH Dataset v2 #1068
Create a new Question/Answer Dataset for more difficult evaluations
- feat: RAG QA Dataset v2 #1075
Utilize new annotations and chunk data for better evaluations on retrieval
- chore: use chunk data in NIAH and QA evals #1067
Store evaluation results as Github artifacts for long-term tracking
- chore: run evaluations in Github workflow #1076
Add an abstraction layer to the evaluation suite to allow custom RAG pipelines
- feat: add abstraction layer to leapfrogai_evals to allow custom RAG pipelines #1173
Fix MMLU scores being constant 🐛
- bug: mmlu evaluation scores are constant #1172

jalling97 added the EPIC ⚔️ EPIC issue to consolidate several sub-issues label Oct 1, 2024

jalling97 added this to the Next (M11) - Conformance | Stability | Documentation milestone Oct 1, 2024

jalling97 self-assigned this Oct 1, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

EPIC: LeapfrogAI Evaluations v1.1 #1171

EPIC: LeapfrogAI Evaluations v1.1 #1171

jalling97 commented Oct 1, 2024 •

edited

Loading

EPIC: LeapfrogAI Evaluations v1.1 #1171

EPIC: LeapfrogAI Evaluations v1.1 #1171

Comments

jalling97 commented Oct 1, 2024 • edited Loading

LeapfrogAI Evaluations v1.1

Description

Completion Criteria

jalling97 commented Oct 1, 2024 •

edited

Loading