-
Notifications
You must be signed in to change notification settings - Fork 17
Open
Description
Evaluation of instruction tuned models is difficult for many of the properties we actually care about.
Language modelling and multiple choice benchmarks may capture some aspects of knowledge and reasoning but don't capture many of the properties we care about in instruction tuned dialog agents, like long term coherence, multi task generalisation, ability to use tools, harmlessness, etc.
To address this, we can try to use LLMs to evaluate LLMs.
Ways to do this (in order of increasing complexity):
- Generate a LM or forced choice QA dataset and evaluate the instruct model offline
- Use reward functions on generations from some (possibly generated) prompt dataset (e.g learn RMs, zero-shot LLM reward functions, etc)
- Online exploration and evaluation using another LLM
We should implement these into our repository.
Basic implementations would be:
- A script that uses langchain and some seed prompts to generate a multiple choice dataset.
- A script that prompts LLMs to rate outputs or generate critiques
- A script that has an LLM attempt to use the LLM under test to complete some task, and a check for if that task was successfully completed.
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels