Skip to content

Eval Instruct  #3

@cat-state

Description

@cat-state

Evaluation of instruction tuned models is difficult for many of the properties we actually care about.
Language modelling and multiple choice benchmarks may capture some aspects of knowledge and reasoning but don't capture many of the properties we care about in instruction tuned dialog agents, like long term coherence, multi task generalisation, ability to use tools, harmlessness, etc.
To address this, we can try to use LLMs to evaluate LLMs.

Ways to do this (in order of increasing complexity):

  1. Generate a LM or forced choice QA dataset and evaluate the instruct model offline
  2. Use reward functions on generations from some (possibly generated) prompt dataset (e.g learn RMs, zero-shot LLM reward functions, etc)
  3. Online exploration and evaluation using another LLM

We should implement these into our repository.
Basic implementations would be:

  1. A script that uses langchain and some seed prompts to generate a multiple choice dataset.
  2. A script that prompts LLMs to rate outputs or generate critiques
  3. A script that has an LLM attempt to use the LLM under test to complete some task, and a check for if that task was successfully completed.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions