Skip to content

stop assuming bwd = 2x fwd #3

@jfc4050

Description

@jfc4050

this is close enough assumption a lot of the time, but it starts falling apart for cases with high model parallel comm. for example expert MLPs backward has wgrad and dgrad, but still only has 2x alltoall

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions