Clarification on data used in RL loop #38

korbinian-hoermann · 2025-02-02T15:58:34Z

Hi,

i just found this paper: ArCHer: Training Language Model Agents via Hierarchical Multi-Turn RL. It outlines the the difference between single-turn RL agents (seek to resolve a request within a single turn) and a multi-turn RL agent, that can execute information-gathering actions and address requests in a targeted manner over several turns.

In #4 you mention:

500 tasks produces roughly between 3000-5000 training data (state-action pair)

I didn't have time yet, to run WebRL myself, but could you maybe give me an intuition about how the data in the RL step looks like?

When sampled from your replay buffer, is it:

A disconnected (<prompt>, <answer>) pair, of any time-step in a successful trajectory (where prompt consists of task instruction, action history and current observation)?
A whole trajectory including:

{
"task" : "do abc", 
"trajectoy": [(o1, a1), (o2, a2), ...]
}

Thank you for the clarification!

The text was updated successfully, but these errors were encountered:

korbinian-hoermann changed the title ~~Clarification of data used in RL loop~~ Clarification on data used in RL loop Feb 2, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Clarification on data used in RL loop #38

Clarification on data used in RL loop #38

korbinian-hoermann commented Feb 2, 2025 •

edited

Loading

Clarification on data used in RL loop #38

Clarification on data used in RL loop #38

Comments

korbinian-hoermann commented Feb 2, 2025 • edited Loading

korbinian-hoermann commented Feb 2, 2025 •

edited

Loading