Skip to content

Curious about the off policy part on pi_{phi} #46

@zpointS

Description

@zpointS

Hi Jianhao and other authors,
Thank you for your wonderful job!
I'm a little bit confused on how you actually handle the pi_{phi} part in off-policy loss. Sorry I'm not very familiar with RL.
It is said in paper that you adopted pi_{phi} == 1, so does that mean the off policy objective simply become the average value of pi_{theta} * adv over all tokens of all samples (ignoring other caliberating parts)?
I'm wondering if you are actually using log(pi_{theta}) in realization because modern LLMs keep telling me although we are given pi_{phi} == 1, we should use logp instead of raw p for pi_{theta} as a common practice in policy gradient. But PG is different from PPO and GRPO, and that's what makes me confused.
Thanks!

Image

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions