Disable DDP averaging to avoid repeated gradient averaging #2323

Shagun-G · 2026-02-04T18:53:55Z

Summary:
From the change to averaging over microbatches to averaging over the global tokens, the averaging for FSDP was disabled in D91432940 but not for DDP. This adds an additional scaling to the gradients diving them twice by the number for DP ranks when using pure DDP. This error does not emerge in change in loss for short runs due to the scale invariance property of the AdamW optimizer but can be seen clearly in the grad norm measuerment and difference in the measurement from FSDP.

To fix this, as DDP does not have a property set_gradient_divide_factor() as FSDP, the solution employed is to put in a comm hook that replaces the default all reduce average operation with an all reduce sum operation

Differential Revision: D92301896

meta-codesync · 2026-02-04T18:54:04Z

@Shagun-G has exported this pull request. If you are a Meta employee, you can view the originating Diff in D92301896.

wwwjn · 2026-02-04T19:22:36Z

torchtitan/models/llama3/infra/parallelize.py

+    is handled manually in the training loop (e.g., dividing by global token count).
+    """
+    return (
+        dist.all_reduce(bucket.buffer(), group=process_group, async_op=True)


Thanks, have you tried using

torchtitan/torchtitan/distributed/utils.py

Line 30 in 1a36996

def _dist_reduce(

instead of all_reduce directly? The difference is this function handles when the tensor is DTensor (eg, when TP is applied and the gradient bucket thensor might be sharded)

Can you attach more test results, eg:

Does HSDP + TP work?

DDP only vs FSDP vs HSDP grad norm curve?

I have updated the test plan showing all the cases. Simultaneously, as DDP does not use DTensor within torch titan and cannot be used with TP or any other parallelism as restricted by design, I think using all_reduce directly is better here to keep things simple.

) Summary: From the change to averaging over microbatches to averaging over the global tokens, the averaging for FSDP was disabled in D91432940 but not for DDP. This adds an additional scaling to the gradients diving them twice by the number for DP ranks when using pure DDP. This error does not emerge in change in loss for short runs due to the scale invariance property of the AdamW optimizer but can be seen clearly in the grad norm measuerment and difference in the measurement from FSDP. To fix this, as DDP does not have a property set_gradient_divide_factor() as FSDP, the solution employed is to put in a comm hook that replaces the default all reduce average operation with an all reduce sum operation. This also requires controlling the syncing performed in DDP as the all reduce operation for DDP is launched every forward backward pass which causes additional addition of gradients during gradient accumulation as no averaging is now performed in DDP. Thus, a no sync context has been added to the train loop to perform an all reduce sum in DDP only in the final backward pass of a train step. Differential Revision: D92301896

Shagun-G · 2026-02-04T21:39:52Z

One major update in the subsequent version of this pull request is the addition of no_sync context management in the training loop for DDP. I am not sure what other way is there to manage this but as updated in the diff summary, this is necessary to ensure correct computations when performing gradient accumulation.

fegin

This has been done in PP, https://github.com/pytorch/pytorch/blob/main/torch/distributed/pipelining/stage.py#L639. Instead of adding the logic to TorchTitan, we should investigate why PP's logic doesn't work. My best guess is we are using replicate so the module type is not DistributedDataParallel, but replicate (or some other module class) cc., @H-Huang

tianyu-l

@fegin

This has been done in PP

Wait why does it have anything to do with PP?

The proper fix is to support set_gradient_divide_factor in replicate(), no?

cc @anshul-si @wwwjn

anshul-si · 2026-02-05T20:16:48Z

@fegin @tianyu-l to be clear, we still haven't merged changes integrating replicate_fsdp version with torchtitan. I think the old replicate is based on DDP right?

fegin · 2026-02-05T22:00:04Z

@tianyu-l My comment was not clear. The set_gradient_divide_factor is just one change, which this PR uses register_comm_hook to achieve. But the modification to PP determining whether set_requires_gradient_sync to True or False should already be handled by PP, https://github.com/pytorch/pytorch/blob/main/torch/distributed/pipelining/stage.py#L639. That's what I meant.

Shagun-G · 2026-02-05T23:04:27Z

Hi everyone, I just want to summarize the diff to help clear confusion. The main source of change is the change from global averaging over tokens from individual local averaging over microbatches from a previous PR. While the implicit default averaging over the individual DP ranks was disabled for FSDP, the DDP averaging in the all reduce operation was not. As the DDP settings used here are from the old primitive, not using DTensor backend (according to what I see, please correct if wrong), no option for set_gradient_divide_factor is available and the all reduce operation has been explicitly changed from averaging to sum.

This change brings up another difficulty as the old DDP primitive performed the all reduce comm operation every backward pass. This is was fine previously as an average was performed every backward pass but now is incorrect as this causes double summation over multiple gradients when using gradient accumulation. To counteract this, all reduce has been disabled for all backward passes other than the last one. This is not necessary under FSDP as is already implemented there. This is also completely independent of PP as it is necessary to obtain the correct gradients even when PP = 1.

fegin · 2026-02-05T23:16:55Z

I misunderstood the code change. I had the impression that the code change was in forward_backward_step, but it is in train_step. So, my comment was incorrect, this was independent from PP.

wwwjn · 2026-02-05T23:40:28Z

@fegin @tianyu-l to be clear, we still haven't merged changes integrating replicate_fsdp version with torchtitan. I think the old replicate is based on DDP right?

Yes, the ideal fix would be integrating replicate_fsdp , and then call set_gradient_divide_factor() as well when pure data parallel is applied. Can you remind me how much work are need to be done before we can merge the replicate_fsdp PR?

Shagun-G requested review from fegin, tianyu-l, wconstab and wwwjn as code owners February 4, 2026 18:53

meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Feb 4, 2026

meta-codesync bot added fb-exported meta-exported labels Feb 4, 2026

wwwjn reviewed Feb 4, 2026

View reviewed changes

wwwjn mentioned this pull request Feb 4, 2026

Grad norm inconsistency when DDP / HSDP is applied before and after #2206 #2318

Open

Shagun-G force-pushed the export-D92301896 branch from 2053a28 to 32b64af Compare February 4, 2026 21:29

Shagun-G force-pushed the export-D92301896 branch from 32b64af to 19ea7df Compare February 4, 2026 21:31

fegin requested changes Feb 4, 2026

View reviewed changes

tianyu-l requested changes Feb 5, 2026

View reviewed changes

tianyu-l added the bug Something isn't working label Feb 5, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Disable DDP averaging to avoid repeated gradient averaging #2323

Disable DDP averaging to avoid repeated gradient averaging #2323

Shagun-G commented Feb 4, 2026

Uh oh!

meta-codesync bot commented Feb 4, 2026

Uh oh!

wwwjn Feb 4, 2026

Uh oh!

Shagun-G Feb 4, 2026 •

edited

Loading

Uh oh!

Shagun-G commented Feb 4, 2026

Uh oh!

fegin left a comment •

edited

Loading

Uh oh!

tianyu-l left a comment

Uh oh!

anshul-si commented Feb 5, 2026

Uh oh!

fegin commented Feb 5, 2026

Uh oh!

Shagun-G commented Feb 5, 2026

Uh oh!

fegin commented Feb 5, 2026

Uh oh!

wwwjn commented Feb 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Disable DDP averaging to avoid repeated gradient averaging #2323

Are you sure you want to change the base?

Disable DDP averaging to avoid repeated gradient averaging #2323

Conversation

Shagun-G commented Feb 4, 2026

Uh oh!

meta-codesync bot commented Feb 4, 2026

Uh oh!

wwwjn Feb 4, 2026

Choose a reason for hiding this comment

Uh oh!

Shagun-G Feb 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Shagun-G commented Feb 4, 2026

Uh oh!

fegin left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tianyu-l left a comment

Choose a reason for hiding this comment

Uh oh!

anshul-si commented Feb 5, 2026

Uh oh!

fegin commented Feb 5, 2026

Uh oh!

Shagun-G commented Feb 5, 2026

Uh oh!

fegin commented Feb 5, 2026

Uh oh!

wwwjn commented Feb 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Shagun-G Feb 4, 2026 •

edited

Loading

fegin left a comment •

edited

Loading