Skip to content

Conversation

@asparius
Copy link

@asparius asparius commented Feb 9, 2026

CFPO in VERL

Overview

Adds support for CFPO (Clipping-Free Policy Optimization), a clipping-free alternative to PPO-style objectives. CFPO replaces ratio clipping with a smooth quadratic penalty, removing zero-gradient regions while maintaining stable updates.

Features

  • No ratio clipping
  • Quadratic penalty objective
  • Fully differentiable
  • No additional hyperparameters
  • Drop-in, opt-in (default behavior unchanged)

Reference

https://arxiv.org/abs/2601.22801

@CLAassistant
Copy link

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
You have signed the CLA already but the status is still pending? Let us recheck it.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds support for the CFPO policy loss. The implementation is a good start, but it includes some unused code, likely from a copy-paste, which should be removed for clarity. I've provided specific suggestions for this. A more critical issue is that the new 'cfpo' loss mode is not added to the configuration files (PolicyLossConfig in verl/workers/config/actor.py and verl/trainer/config/actor/actor.yaml), which will prevent users from selecting it. This should be addressed to make the feature usable.

Comment on lines 1293 to 1304
clip_ratio_c = config.get( # Lower bound of the ratio for dual-clip PPO. See https://arxiv.org/pdf/1912.09729.
"clip_ratio_c", 3.0
)

cliprange = clip_ratio
cliprange_low = clip_ratio_low
cliprange_high = clip_ratio_high

assert clip_ratio_c > 1.0, (
"The lower bound of the clip_ratio_c for dual-clip PPO should be greater than 1.0,"
+ f" but get the value: {clip_ratio_c}."
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The clip_ratio_c parameter and its associated assertion are specific to dual-clip PPO and are not used in the CFPO loss calculation or its metrics. This code should be removed to avoid confusion and keep the implementation clean.

Suggested change
clip_ratio_c = config.get( # Lower bound of the ratio for dual-clip PPO. See https://arxiv.org/pdf/1912.09729.
"clip_ratio_c", 3.0
)
cliprange = clip_ratio
cliprange_low = clip_ratio_low
cliprange_high = clip_ratio_high
assert clip_ratio_c > 1.0, (
"The lower bound of the clip_ratio_c for dual-clip PPO should be greater than 1.0,"
+ f" but get the value: {clip_ratio_c}."
)
cliprange = clip_ratio
cliprange_low = clip_ratio_low
cliprange_high = clip_ratio_high

Comment on lines 1313 to 1332
### This code is for logging purposes
pg_losses1 = -advantages * ratio
if cliprange_low is None:
cliprange_low = cliprange
if cliprange_high is None:
cliprange_high = cliprange
pg_losses2 = -advantages * torch.clamp(
ratio, 1 - cliprange_low, 1 + cliprange_high
) # - clip(ratio, 1-cliprange, 1+cliprange) * A
clip_pg_losses1 = torch.maximum(
pg_losses1, pg_losses2
) # max(-ratio * A, -clip(ratio, 1-cliprange, 1+cliprange) * A)
pg_clipfrac = verl_F.masked_mean(torch.gt(pg_losses2, pg_losses1).float(), response_mask)

pg_losses3 = -advantages * clip_ratio_c

pg_clipfrac_lower = verl_F.masked_mean(
torch.gt(clip_pg_losses1, pg_losses3) * (advantages < 0).float(), response_mask
)
### This code is for logging purposes
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

This block of code, marked for "logging purposes", appears to be dead code from a copy-paste. The variables pg_clipfrac and pg_clipfrac_lower are calculated here but are immediately overwritten by a new calculation later in the function. Other variables defined in this block are not used outside of it. This block should be removed to improve clarity and avoid unnecessary computations.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants