Proposal: Multi-turn support #3059

yifanmai · 2024-10-10T22:28:14Z

Background

Currently, HELM only supports single-turn scenarios. The main evaluation flow in Runner.run_one() works at follows:

Get instances from the scenario
Get prompts for the instances from the adapter
Run prompts through the model to get outputs
Score outputs on metrics

However, some scenarios need to look at outputs from the model, and then send more prompts to the model. Examples:

In MT-Bench, the model is given an initial prompt, and then the it is given a follow up prompt with the context of the first interaction.
In SOTOPIA, two models converse for several turns, and a third model grades the conversation.

Proposal

Add a loop to Runner.run_one() around steps 2 and 3. After running prompts through the model, call the adapter again to get additional prompts, and then run those through the model. Repeat until the adapter stops producing new prompts.

Concretely, the Adapter will have a new method for generating more request states:

class Adapter(ABC):
    def adapt_next_turn(self, instances: List[Instance], scenario_state: ScenarioState, parallelism: int) -> List[RequestState]:
        """
        Takes `Instance`s and all previous `RequestState`s, and returns new `RequestState`s.
        """
        return []

In the frontend, all requests will be grouped under the same instance and displayed.

Pros

Because the next turn for all instances are generated and processed simultaneously in each instance of the loop, we get thread-level parallelism when processing requests to the models.

Cons

The API is somewhat unnatural if the user is thinking in terms of the chronology of a single conversation, since the method requires operating on all conversations at once. The user thus needs to perform some internal bookeeping. We could alleviate this by providing bookkeeping utilities in utility classes or functions.

This doesn't address the high storage costs of multi-turn conversations, which is O(N^2), since we keep all N requests, and the length of the Nth request grows with N since it includes all previous context.

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Proposal: Multi-turn support #3059

Proposal: Multi-turn support #3059

yifanmai commented Oct 10, 2024

Proposal: Multi-turn support #3059

Proposal: Multi-turn support #3059

Comments

yifanmai commented Oct 10, 2024

Background

Proposal

Pros

Cons