You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
i just found this paper: ArCHer: Training Language Model Agents via Hierarchical Multi-Turn RL. It outlines the the difference between single-turn RL agents (seek to resolve a request within a single turn) and a multi-turn RL agent, that can execute information-gathering actions and address requests in a targeted manner over several turns.
500 tasks produces roughly between 3000-5000 training data (state-action pair)
I didn't have time yet, to run WebRL myself, but could you maybe give me an intuition about how the data in the RL step looks like?
When sampled from your replay buffer, is it:
A disconnected (<prompt>, <answer>) pair, of any time-step in a successful trajectory (where prompt consists of task instruction, action history and current observation)?
Hi,
i just found this paper: ArCHer: Training Language Model Agents via Hierarchical Multi-Turn RL. It outlines the the difference between single-turn RL agents (seek to resolve a request within a single turn) and a multi-turn RL agent, that can execute information-gathering actions and address requests in a targeted manner over several turns.
In #4 you mention:
I didn't have time yet, to run WebRL myself, but could you maybe give me an intuition about how the data in the RL step looks like?
When sampled from your replay buffer, is it:
A disconnected
(<prompt>, <answer>)
pair, of any time-step in a successful trajectory (where prompt consists of task instruction, action history and current observation)?A whole trajectory including:
Thank you for the clarification!
The text was updated successfully, but these errors were encountered: