Refactors the loss calculation to pull it out into a free function #1137

finbarrtimbers · 2025-11-03T20:12:03Z

Also adds a test case covering the bug that Hamish pointed out, and fixes it by switching from sample-wise to token-wise normalization. Before:

loss = masked_mean(
      pg_loss_max + (args.beta * kl), response_masks_bool, args.masked_mean_axis, args.masked_mean_denominator)
loss = loss / accumulation_steps

Now:

total_loss = pg_loss_max + (args.beta * kl)
loss_sum = (total_loss * response_masks_bool).sum()
denominator = (
      args.masked_mean_denominator if args.masked_mean_denominator is not None else response_masks_bool.sum()
)
loss = loss_sum / denominator

Hamish's original description of the bug:

One thing I realised while writing some stuff up on the weekend - I think our loss computation for the RL is actually technically a little wrong.
We do gradient accumulation to simulate higher batch sizes, but in really basically always do real bsz=1.
Gradient accumulation means we average the gradients for each item in the batch, so we get something like a sample-wise loss (image 1)
But in practice we should probably do a token-level loss like DAPO (image 2), or sum + divide by constant like Dr GRPO?
I think this requires keeping track of the num of tokens in each batch. You swap the loss to take a sum, and then compute loss as:
loss = (loss * grad_acc steps * num_gpus) / num_total_tokens
basically divide the loss by the total number of tokens in the batch, and then we multiply by grad_acc_steps to account for the grad_acc averaging, and num_gpus to account for the fact that each GPU is doing this separately and we are averaging across them…. if that makes sense
theres an explanation here

Note

Extracts GRPO loss computation into a free function, introduces metrics utilities and truncated importance sampling, updates training loop accordingly, and adds focused unit tests.

Training/Algo (GRPO):
- Extracts loss computation to calculate_loss_and_backward(...) and integrates into training loop.
- Adds maybe_apply_importance_sampling(...) (truncated importance sampling) gated by args.truncated_importance_sampling_ratio_cap.
- Introduces compare_logprobs(...) to log vLLM vs local logprob diffs and reverse KL.
- Replaces ad-hoc metrics tracking with LossStatistics (in open_instruct/metrics.py) for KL, clipfrac, policy/total loss, ratio, and optional entropy; updates metric aggregation/return values.
- Refactors old/vLLM logprobs handling and optimizer stepping condition (local_step % accumulation_steps == 0).
New Module:
- open_instruct/metrics.py: masked_mean(...) and LossStatistics for centralized metric computation.
Tests:
- Adds unit tests for compare_logprobs, maybe_apply_importance_sampling, and calculate_loss_and_backward; updates GRPO fast tests to cover new paths.
Docs:
- Updates CLAUDE.md with a comments policy.

^{Written by Cursor Bugbot for commit adb108b. This will update automatically on new commits. Configure here.}

finbarrtimbers added 7 commits November 3, 2025 13:11

First commit.

d35b4dd

Merge branch 'main' into refactor-loss

8b0b252

Cleaned up code

2d450c2

Added metrics class

41bdc23

Updated code

e8b6f32

cleaned up code

d677b33

Cleaned up code

4f15218

finbarrtimbers changed the title ~~Refactors the loss calculation to make it testable.~~ Refactors the loss calculation to pull it out into a free function Nov 6, 2025

finbarrtimbers added 4 commits November 6, 2025 09:40

Merge branch 'main' into refactor-loss

783579d

updated code to remove metricstracker

abe4387

Updated code

3f716d6

Update code

196a26b

finbarrtimbers force-pushed the refactor-loss branch from 26ee333 to 196a26b Compare November 7, 2025 17:29

Added tests

adb108b

finbarrtimbers marked this pull request as ready for review November 7, 2025 21:54

Merge branch 'main' into refactor-loss

679351d

finbarrtimbers requested a review from hamishivi November 7, 2025 21:55

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Refactors the loss calculation to pull it out into a free function #1137

Refactors the loss calculation to pull it out into a free function #1137

Uh oh!

finbarrtimbers commented Nov 3, 2025 •

edited by cursor bot

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Refactors the loss calculation to pull it out into a free function #1137

Are you sure you want to change the base?

Refactors the loss calculation to pull it out into a free function #1137

Uh oh!

Conversation

finbarrtimbers commented Nov 3, 2025 • edited by cursor bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

finbarrtimbers commented Nov 3, 2025 •

edited by cursor bot

Loading