fix bug of GRPO default setting #113

MozerWang · 2025-10-29T14:58:32Z

Fix Double Normalization Bug in GRPO Training

Summary

This PR fixes a critical bug in the GRPO training script where advantage values were being normalized twice at different granularities, leading to biased gradient estimates and potentially degraded training performance.

Problem Description

Issue 1: Double Normalization

The training pipeline performs normalization twice:

First normalization (Actor-side, Episode-level) - Line 692-696:
- Normalizes rewards at the episode level within each group
- For critic_type2="grpo": (reward - group_mean) / group_std
- For critic_type2="drgrpo": (reward - group_mean)
- For critic_type2="rloo": Uses leave-one-out mean subtraction
Second normalization (Learner-side, Transition-level) - Line 1005-1018:
- When norm_return=True, normalizes again using global statistics across all transitions
- (advantages - global_mean) / global_std

This double normalization is redundant and mathematically incorrect.

Issue 2: Subsampling Between Normalizations

Between the two normalizations, there is a random subsampling step (Line 233-240):

if len(all_transitions) > self.args.rollout_batch_size_per_device:
    subsample_indices = np.random.choice(...)
    all_transitions = [all_transitions[si] for si in subsample_indices]

Problem: The second normalization operates on a biased subset of the originally normalized data:

Subsampling may disproportionately remove transitions from certain groups
The global statistics computed on the subsampled data do not match the original distribution
This introduces sampling bias into the advantage estimates

Issue 3: Granularity Mismatch

The two normalizations operate at different granularities:

First normalization: Episode-level
- Each episode (regardless of length) contributes one sample to statistics
- Example: group_rewards = [4.0, 2.0] → normalized = [+1.0, -1.0]
Second normalization: Transition/Step-level
- Each step/transition contributes one sample to statistics
- Example: If episodes have different lengths (2 steps vs 10 steps), longer episodes dominate the statistics

Consequence: Longer episodes become over-represented in the second normalization, biasing the mean and standard deviation toward these episodes and breaking the episode-level comparison that GRPO/RLOO/Dr.GRPO rely on.

Root Cause

The norm_return=True default was incorrectly applied from standard PPO settings. However, in the GRPO/Dr.GRPO/RLOO setting:

Normalization is already performed at the episode level during advantage computation
The learner-side normalization is redundant and harmful
This is inconsistent with the OAT library's original PPO implementation, which does not include this learner-side norm_return logic

Solution

Set norm_return=False by default (Line 114) to disable the second normalization in the learner.

Changes

File: examples/train_oat/train_oat_grpo.py
Line 114: Changed norm_return: bool = True → norm_return: bool = False
Added comment: Clarifies why this should be False

# Before
norm_return: bool = True

# After  
norm_return: bool = False  # Should be False to avoid double normalization

fix grpo default setting

6de44a1

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix bug of GRPO default setting #113

fix bug of GRPO default setting #113

Uh oh!

MozerWang commented Oct 29, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

fix bug of GRPO default setting #113

Are you sure you want to change the base?

fix bug of GRPO default setting #113

Uh oh!

Conversation

MozerWang commented Oct 29, 2025

Fix Double Normalization Bug in GRPO Training

Summary

Problem Description

Issue 1: Double Normalization

Issue 2: Subsampling Between Normalizations

Issue 3: Granularity Mismatch

Root Cause

Solution

Changes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant