Skip to content

Grad norm inconsistency when DDP / HSDP is applied before and after #2206 #2318

@wwwjn

Description

@wwwjn

Bug description

In short, DDP/HSDP and FSDP grad norms have a scaling difference of a factor of 8 for the current version of torchtitan.

Image

Versions

latest torchtitan

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions