Description:
We want to be able to evaluate the factual correctness of lex.llm workflows with lex.eval
Acceptance criteria:
- Possible to run factual correctness evaluation based on data from lex.db for a particular workflow
Technical details:
Details on factual correctness can be found here: https://docs.ragas.io/en/stable/concepts/metrics/available_metrics/factual_correctness/#factual-correctness
We will likely not be able to use RAGAS directly - but we would probably want to use Pydantic Evals https://ai.pydantic.dev/evals/#pydantic-evals-package
It's likely easiest to set up the system with a CLI to begin with.
Design:
Optional details on design for context.