-
Notifications
You must be signed in to change notification settings - Fork 51
Open
Description
Hi Jianhao and other authors,
Thank you for your wonderful job!
I'm a little bit confused on how you actually handle the pi_{phi} part in off-policy loss. Sorry I'm not very familiar with RL.
It is said in paper that you adopted pi_{phi} == 1, so does that mean the off policy objective simply become the average value of pi_{theta} * adv over all tokens of all samples (ignoring other caliberating parts)?
I'm wondering if you are actually using log(pi_{theta}) in realization because modern LLMs keep telling me although we are given pi_{phi} == 1, we should use logp instead of raw p for pi_{theta} as a common practice in policy gradient. But PG is different from PPO and GRPO, and that's what makes me confused.
Thanks!

Metadata
Metadata
Assignees
Labels
No labels