You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Currently, HELM only supports single-turn scenarios. The main evaluation flow in Runner.run_one() works at follows:
Get instances from the scenario
Get prompts for the instances from the adapter
Run prompts through the model to get outputs
Score outputs on metrics
However, some scenarios need to look at outputs from the model, and then send more prompts to the model. Examples:
In MT-Bench, the model is given an initial prompt, and then the it is given a follow up prompt with the context of the first interaction.
In SOTOPIA, two models converse for several turns, and a third model grades the conversation.
Proposal
Add a loop to Runner.run_one() around steps 2 and 3. After running prompts through the model, call the adapter again to get additional prompts, and then run those through the model. Repeat until the adapter stops producing new prompts.
Concretely, the Adapter will have a new method for generating more request states:
classAdapter(ABC):
defadapt_next_turn(self, instances: List[Instance], scenario_state: ScenarioState, parallelism: int) ->List[RequestState]:
""" Takes `Instance`s and all previous `RequestState`s, and returns new `RequestState`s. """return []
In the frontend, all requests will be grouped under the same instance and displayed.
Pros
Because the next turn for all instances are generated and processed simultaneously in each instance of the loop, we get thread-level parallelism when processing requests to the models.
Cons
The API is somewhat unnatural if the user is thinking in terms of the chronology of a single conversation, since the method requires operating on all conversations at once. The user thus needs to perform some internal bookeeping. We could alleviate this by providing bookkeeping utilities in utility classes or functions.
This doesn't address the high storage costs of multi-turn conversations, which is O(N^2), since we keep all N requests, and the length of the Nth request grows with N since it includes all previous context.
The text was updated successfully, but these errors were encountered:
Background
Currently, HELM only supports single-turn scenarios. The main evaluation flow in
Runner.run_one()
works at follows:However, some scenarios need to look at outputs from the model, and then send more prompts to the model. Examples:
Proposal
Add a loop to
Runner.run_one()
around steps 2 and 3. After running prompts through the model, call the adapter again to get additional prompts, and then run those through the model. Repeat until the adapter stops producing new prompts.Concretely, the Adapter will have a new method for generating more request states:
In the frontend, all requests will be grouped under the same instance and displayed.
Pros
Because the next turn for all instances are generated and processed simultaneously in each instance of the loop, we get thread-level parallelism when processing requests to the models.
Cons
The API is somewhat unnatural if the user is thinking in terms of the chronology of a single conversation, since the method requires operating on all conversations at once. The user thus needs to perform some internal bookeeping. We could alleviate this by providing bookkeeping utilities in utility classes or functions.
This doesn't address the high storage costs of multi-turn conversations, which is O(N^2), since we keep all N requests, and the length of the Nth request grows with N since it includes all previous context.
The text was updated successfully, but these errors were encountered: