-
Notifications
You must be signed in to change notification settings - Fork 3.2k
adding CFPO to verl #5252
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
adding CFPO to verl #5252
Conversation
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request adds support for the CFPO policy loss. The implementation is a good start, but it includes some unused code, likely from a copy-paste, which should be removed for clarity. I've provided specific suggestions for this. A more critical issue is that the new 'cfpo' loss mode is not added to the configuration files (PolicyLossConfig in verl/workers/config/actor.py and verl/trainer/config/actor/actor.yaml), which will prevent users from selecting it. This should be addressed to make the feature usable.
verl/trainer/ppo/core_algos.py
Outdated
| clip_ratio_c = config.get( # Lower bound of the ratio for dual-clip PPO. See https://arxiv.org/pdf/1912.09729. | ||
| "clip_ratio_c", 3.0 | ||
| ) | ||
|
|
||
| cliprange = clip_ratio | ||
| cliprange_low = clip_ratio_low | ||
| cliprange_high = clip_ratio_high | ||
|
|
||
| assert clip_ratio_c > 1.0, ( | ||
| "The lower bound of the clip_ratio_c for dual-clip PPO should be greater than 1.0," | ||
| + f" but get the value: {clip_ratio_c}." | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The clip_ratio_c parameter and its associated assertion are specific to dual-clip PPO and are not used in the CFPO loss calculation or its metrics. This code should be removed to avoid confusion and keep the implementation clean.
| clip_ratio_c = config.get( # Lower bound of the ratio for dual-clip PPO. See https://arxiv.org/pdf/1912.09729. | |
| "clip_ratio_c", 3.0 | |
| ) | |
| cliprange = clip_ratio | |
| cliprange_low = clip_ratio_low | |
| cliprange_high = clip_ratio_high | |
| assert clip_ratio_c > 1.0, ( | |
| "The lower bound of the clip_ratio_c for dual-clip PPO should be greater than 1.0," | |
| + f" but get the value: {clip_ratio_c}." | |
| ) | |
| cliprange = clip_ratio | |
| cliprange_low = clip_ratio_low | |
| cliprange_high = clip_ratio_high |
verl/trainer/ppo/core_algos.py
Outdated
| ### This code is for logging purposes | ||
| pg_losses1 = -advantages * ratio | ||
| if cliprange_low is None: | ||
| cliprange_low = cliprange | ||
| if cliprange_high is None: | ||
| cliprange_high = cliprange | ||
| pg_losses2 = -advantages * torch.clamp( | ||
| ratio, 1 - cliprange_low, 1 + cliprange_high | ||
| ) # - clip(ratio, 1-cliprange, 1+cliprange) * A | ||
| clip_pg_losses1 = torch.maximum( | ||
| pg_losses1, pg_losses2 | ||
| ) # max(-ratio * A, -clip(ratio, 1-cliprange, 1+cliprange) * A) | ||
| pg_clipfrac = verl_F.masked_mean(torch.gt(pg_losses2, pg_losses1).float(), response_mask) | ||
|
|
||
| pg_losses3 = -advantages * clip_ratio_c | ||
|
|
||
| pg_clipfrac_lower = verl_F.masked_mean( | ||
| torch.gt(clip_pg_losses1, pg_losses3) * (advantages < 0).float(), response_mask | ||
| ) | ||
| ### This code is for logging purposes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This block of code, marked for "logging purposes", appears to be dead code from a copy-paste. The variables pg_clipfrac and pg_clipfrac_lower are calculated here but are immediately overwritten by a new calculation later in the function. Other variables defined in this block are not used outside of it. This block should be removed to improve clarity and avoid unnecessary computations.
CFPO in VERL
Overview
Adds support for CFPO (Clipping-Free Policy Optimization), a clipping-free alternative to PPO-style objectives. CFPO replaces ratio clipping with a smooth quadratic penalty, removing zero-gradient regions while maintaining stable updates.
Features
Reference
https://arxiv.org/abs/2601.22801