Curious about the off policy part on pi_{phi}

Hi Jianhao and other authors,
  Thank you for your wonderful job!
  I'm a little bit confused on how you actually handle the `pi_{phi}` part in off-policy loss. Sorry I'm not very familiar with RL.
  It is said in paper that you adopted `pi_{phi} == 1`, so does that mean the off policy objective simply become the average value of `pi_{theta} * adv` over all tokens of all samples (ignoring other caliberating parts)? 
  I'm wondering if you are actually using `log(pi_{theta})` in realization because modern LLMs keep telling me although we are given `pi_{phi} == 1`, we should use `logp` instead of raw `p` for `pi_{theta}` as a common practice in policy gradient. But PG is different from PPO and GRPO, and that's what makes me confused.
  Thanks!

<img width="798" height="197" alt="Image" src="https://github.com/user-attachments/assets/626f62d8-b976-4f76-96c7-6c9512d920ab" />

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Curious about the off policy part on pi_{phi} #46

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Curious about the off policy part on pi_{phi} #46

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions