Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clarification on data used in RL loop #38

Open
korbinian-hoermann opened this issue Feb 2, 2025 · 0 comments
Open

Clarification on data used in RL loop #38

korbinian-hoermann opened this issue Feb 2, 2025 · 0 comments

Comments

@korbinian-hoermann
Copy link

korbinian-hoermann commented Feb 2, 2025

Hi,

i just found this paper: ArCHer: Training Language Model Agents via Hierarchical Multi-Turn RL. It outlines the the difference between single-turn RL agents (seek to resolve a request within a single turn) and a multi-turn RL agent, that can execute information-gathering actions and address requests in a targeted manner over several turns.

In #4 you mention:

500 tasks produces roughly between 3000-5000 training data (state-action pair)

I didn't have time yet, to run WebRL myself, but could you maybe give me an intuition about how the data in the RL step looks like?

When sampled from your replay buffer, is it:

  1. A disconnected (<prompt>, <answer>) pair, of any time-step in a successful trajectory (where prompt consists of task instruction, action history and current observation)?

  2. A whole trajectory including:

{
"task" : "do abc", 
"trajectoy": [(o1, a1), (o2, a2), ...]
}

Thank you for the clarification!

@korbinian-hoermann korbinian-hoermann changed the title Clarification of data used in RL loop Clarification on data used in RL loop Feb 2, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant