Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TransformerEngine : Add test with FSDP (and updates to ddp_wrapper in test_ddp.py) #142

Merged
merged 10 commits into from
Apr 25, 2024

Conversation

kshitij12345
Copy link
Collaborator

@kshitij12345 kshitij12345 commented Apr 8, 2024

This PR adds test for using TE executor in FSDP and verifies it against Eager + TE. Also we update the ddp_wrapper to allow wrapping with different pytest-fixture besides bucket_size_in_mb (which errored when I tried to add a different pytest-fixture).

PR #80 description details of how TE automatically takes care of syncing FP8 meta-data in distributed setting.

Also, I have verified it on a larger model using the available benchmarking script
cmd for benchmark:

torchrun --nproc-per-node=2 thunder/benchmarks/benchmark_litgpt.py --compile thunder+nvfuser+transformerengine+cudnn --n_layers=10 --distributed_mode=fsdp

Numbers are on RTX 6000

Without TE

iter 41: loss 4.6562, iter time: 3180.77ms, t: 4096
iter 42: loss 4.6250, iter time: 3202.35ms, t: 4096
iter 43: loss 4.6562, iter time: 3172.88ms, t: 4096
iter 44: loss 4.6562, iter time: 3181.55ms, t: 4096
Model name: Llama-2-7b-hf
Seq Length: 4096
Micro BS: 1
Global BS: 2
Number of Layers: 10
Number of parameters: 1.14B
Distributed Mode: fsdp
Sharding Mode: zero2
Bucketing: none
Compiler: thunder+nvfuser+cudnn
Average iter time: 3187.63 ms
Memory used: 30.56 GB
Tokens/s: 2570.17
Tokens/s/GPU: 1285.09
TFLOP/s: 38.40

With TE

iter 42: loss 4.6562, iter time: 3025.66ms, t: 4096
iter 43: loss 4.6562, iter time: 3030.40ms, t: 4096
iter 44: loss 4.6562, iter time: 3018.83ms, t: 4096
Model name: Llama-2-7b-hf
Seq Length: 4096
Micro BS: 1
Global BS: 2
Number of Layers: 10
Number of parameters: 1.14B
Distributed Mode: fsdp
Sharding Mode: zero2
Bucketing: none
Compiler: thunder+nvfuser+transformerenginevfuser+cudnn
Average iter time: 3024.72 ms
Memory used: 37.25 GB
Tokens/s: 2708.12
Tokens/s/GPU: 1354.06
TFLOP/s: 40.47

@kshitij12345 kshitij12345 changed the title TransformerEngine : Add test with FSDP (and update to ddp_wrapper in test_ddp.py) TransformerEngine : Add test with FSDP (and updates to ddp_wrapper in test_ddp.py) Apr 8, 2024
@kshitij12345 kshitij12345 marked this pull request as ready for review April 9, 2024 08:24
thunder/tests/distributed/test_ddp.py Outdated Show resolved Hide resolved
thunder/tests/distributed/test_ddp.py Outdated Show resolved Hide resolved
thunder/tests/distributed/test_ddp.py Outdated Show resolved Hide resolved
thunder/tests/distributed/test_ddp.py Show resolved Hide resolved
@kshitij12345 kshitij12345 enabled auto-merge (squash) April 25, 2024 06:48
@kshitij12345
Copy link
Collaborator Author

@carmocca please have a look and stamp the PR, thanks!

thunder/tests/distributed/test_ddp.py Outdated Show resolved Hide resolved
thunder/tests/distributed/test_ddp.py Outdated Show resolved Hide resolved
thunder/tests/distributed/test_ddp.py Outdated Show resolved Hide resolved
thunder/tests/distributed/test_ddp.py Outdated Show resolved Hide resolved
thunder/tests/distributed/test_ddp.py Outdated Show resolved Hide resolved
thunder/tests/distributed/test_ddp.py Outdated Show resolved Hide resolved
thunder/tests/distributed/test_ddp.py Show resolved Hide resolved
kshitij12345 and others added 2 commits April 25, 2024 16:42
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
@kshitij12345 kshitij12345 merged commit 279380e into Lightning-AI:main Apr 25, 2024
39 checks passed
@kshitij12345 kshitij12345 deleted the te_fsdp_test branch April 25, 2024 16:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants