-
Notifications
You must be signed in to change notification settings - Fork 84
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
TE executor: DDP support (PR2408) #80
Conversation
Blocked on #81 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perfect!
Ping @t-vi for merging. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The minor nit about environment variables aside, I am concerned about this approach long-term. Wouldn't we also have to do something special for the FSDP transform, then? And couldn't any other arbitrary pass reorder things in such a way that it breaks this implicit requirement?
Perhaps I am misunderstanding something; apologies if so!
If TE is doing communication implicitly and that communication can cause issues, I think we need to have real dialogue with the TE team on this. I wouldn't be surprised if the answer is that we need to take over TE's communication here. Two systems scheduling comms is going to lead us towards a deadlock-prone system.
For now, #74 makes sure that we don't have additional transforms after this. But I agree that this approach is quite fragile and ideally we should have the ability to inform TE when to sync.
We have started communication with TE team with a similar request. I will loop you in the communication so that you can elaborate there in case we missed anything. Thanks for having a look |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM and thank you!
@carmocca I think this is ready to merge. The CI failures look unrelated. |
Once we get past the flakes automerge will do its thing 🙌 |
Whenever, the world size is greater than 1, the first TE module in the forward (i.e. the last to execute backward) takes care of syncing the fp8 meta-data state across all processes before the next iteration begins.
During forward, TE takes care of setting up for fp8 meta-data reduction if
world_size > 1
for each TE module.https://github.com/NVIDIA/TransformerEngine/blob/a38b291b0d1b04847e8ab1df8550df642a03a27d/transformer_engine/pytorch/module/base.py#L552-L564
During backward of the first TE module in forward pass (/last in backward pass), it takes care of actually syncing the FP8 meta-data and this is by default synchronous/blocking. (It essentially does the torch.cat of all fp8 state, reduction, and torch.split on the reduced state back to individual buffers)
https://github.com/NVIDIA/TransformerEngine/blob/8255f87f3ee8076db21777795ce15b6ddf8754c0/transformer_engine/pytorch/module/base.py#L98-L100
This means there are constraints on re-ordering of the
te_linear
, seeNOTE: TransformerEngine Distributed Ordering Constraint
intorch_autograd.py
.Thanks @crcrpar for pointing me towards this.
This PR adds a DDP test for
thunder+TE executor
compared toPyTorch Eager + TE
.Benchmark numbers on a real a model:
Running
examples/llama2.c/train.py
on 2 Ada RTX6000Cmd :
torchrun --nproc-per-node=2 train.py
Without TE
With TE
Patch for Benchmark