Skip to content

Conversation

@MozerWang
Copy link

Fix Double Normalization Bug in GRPO Training

Summary

This PR fixes a critical bug in the GRPO training script where advantage values were being normalized twice at different granularities, leading to biased gradient estimates and potentially degraded training performance.

Problem Description

Issue 1: Double Normalization

The training pipeline performs normalization twice:

  1. First normalization (Actor-side, Episode-level) - Line 692-696:

    • Normalizes rewards at the episode level within each group
    • For critic_type2="grpo": (reward - group_mean) / group_std
    • For critic_type2="drgrpo": (reward - group_mean)
    • For critic_type2="rloo": Uses leave-one-out mean subtraction
  2. Second normalization (Learner-side, Transition-level) - Line 1005-1018:

    • When norm_return=True, normalizes again using global statistics across all transitions
    • (advantages - global_mean) / global_std

This double normalization is redundant and mathematically incorrect.

Issue 2: Subsampling Between Normalizations

Between the two normalizations, there is a random subsampling step (Line 233-240):

if len(all_transitions) > self.args.rollout_batch_size_per_device:
    subsample_indices = np.random.choice(...)
    all_transitions = [all_transitions[si] for si in subsample_indices]

Problem: The second normalization operates on a biased subset of the originally normalized data:

  • Subsampling may disproportionately remove transitions from certain groups
  • The global statistics computed on the subsampled data do not match the original distribution
  • This introduces sampling bias into the advantage estimates

Issue 3: Granularity Mismatch

The two normalizations operate at different granularities:

  1. First normalization: Episode-level

    • Each episode (regardless of length) contributes one sample to statistics
    • Example: group_rewards = [4.0, 2.0]normalized = [+1.0, -1.0]
  2. Second normalization: Transition/Step-level

    • Each step/transition contributes one sample to statistics
    • Example: If episodes have different lengths (2 steps vs 10 steps), longer episodes dominate the statistics

Consequence: Longer episodes become over-represented in the second normalization, biasing the mean and standard deviation toward these episodes and breaking the episode-level comparison that GRPO/RLOO/Dr.GRPO rely on.

Root Cause

The norm_return=True default was incorrectly applied from standard PPO settings. However, in the GRPO/Dr.GRPO/RLOO setting:

  • Normalization is already performed at the episode level during advantage computation
  • The learner-side normalization is redundant and harmful
  • This is inconsistent with the OAT library's original PPO implementation, which does not include this learner-side norm_return logic

Solution

Set norm_return=False by default (Line 114) to disable the second normalization in the learner.

Changes

  • File: examples/train_oat/train_oat_grpo.py
  • Line 114: Changed norm_return: bool = Truenorm_return: bool = False
  • Added comment: Clarifies why this should be False
# Before
norm_return: bool = True

# After  
norm_return: bool = False  # Should be False to avoid double normalization

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant