[trainer] fix: normalize sft loss by num_tokens in global batch by wuxibin89 · Pull Request #3994 · verl-project/verl

wuxibin89 · 2025-11-03T08:15:14Z

What does this PR do?

Normalize sft loss by num_tokens in global batch.

TODO

Normalize pg_loss and value_loss in rl trainer pipeline.

gemini-code-assist

Code Review

This pull request aims to normalize the SFT loss by the number of tokens in the global batch. While the intention is correct, I've identified a few critical issues in the implementation that will lead to incorrect loss calculation. Firstly, the logic to handle sequence boundaries when rolling the loss_mask in sft_loss has been removed, causing incorrect loss terms at sequence boundaries. Secondly, the calculation of batch_num_tokens in both FSDP and Megatron implementations is flawed because it doesn't account for the tokens that are subsequently masked out in the loss function. This results in an incorrect normalization factor. I have provided specific suggestions to address these critical bugs.

verl/workers/engine/fsdp/transformer_impl.py

verl/workers/engine/megatron/transformer_impl.py

gemini-code-assist · 2025-11-03T08:17:53Z

verl/workers/roles/utils/losses.py

        loss_mask_flatten = loss_mask.values()

        # left-shift the loss mask by one token to align with log_prob
        loss_mask_flatten = torch.roll(loss_mask_flatten, shifts=-1, dims=0)


The logic to handle sequence boundaries when rolling the loss_mask has been removed. When torch.roll is applied to a flattened tensor of multiple sequences, it incorrectly wraps elements from the beginning of one sequence to the end of the previous one. The original code correctly zeroed out the mask at the last token of each sequence to prevent incorrect loss calculation at these boundaries. This removal will lead to incorrect loss values. Please restore this boundary handling logic.

Suggested change

loss_mask_flatten = torch.roll(loss_mask_flatten, shifts=-1, dims=0)

loss_mask_flatten = torch.roll(loss_mask_flatten, shifts=-1, dims=0)

cu_seqlens = log_prob.offsets()

loss_mask_flatten[cu_seqlens[1:] - 1] = 0

Last token in each sequence is masked by first loss_mask in next sequence after left shift.

loss_mask: [0, 0, 1, 1, 1], [0, 0, 0, 1, 1], [0, 1, 1, 1] loss_mask_flatten: [0, 1, 1, 1, 0], [0, 0, 1, 1, 0], [1, 1, 1, 0]

And the num of valid tokens is not changed after left shift.

techkang · 2025-11-03T08:57:54Z

verl/workers/engine/megatron/transformer_impl.py

-            global_bsz = data["global_batch_size"]
-            loss_scale_factor = local_micro_bsz / (global_bsz / self.get_data_parallel_size())
-            loss = loss * loss_scale_factor
+            loss = loss * data["num_micro_batch"] / mpu.get_context_parallel_world_size()


We should add a note: the sft_loss is used in the FSDP backend (which handles scaling automatically), while it requires manual scaling in the Megatron backend.

Add note in verl/workers/roles/utils/losses.py

@techkang