Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[benchmark] add option to use torchao's float8 of dynamic scaling with fsdp2 #997

Merged
merged 9 commits into from
Aug 23, 2024

Conversation

crcrpar
Copy link
Collaborator

@crcrpar crcrpar commented Aug 19, 2024

This adds an option to use float8 of torchao.
An example command to use float8 with FSDP2:

torchrun --nproc-per-node 8 --local-ranks-filter 0 --role rank --tee 3 thunder/benchmarks/benchmark_litgpt.py --model_name Llama-2-7b-hf --compile inductor --distributed_mode fsdp2 --shard_mode zero2 --use_torchao_fp8_linear true --use_torchao_fp8_allgather true --use_torchao_fp8_precompute_scale_for_fsdp true

note that --use_torchao_fp8_precompute_scale_for_fsdp true keeps models in fp32.

Llama-2-7b-hf on 8 H100, using pjnl-20240821

as of ff66203, torchao of pytorch/ao@a0376bf

if compiler is torch, fsdp2 is used.

compiler executors                             bs Tokens/s/GPU Memory Used
torch fp8: linear, all-gather, & precompute 1 14640.35 34.26
torch fp8: linear, all-gather, & precompute 2 17993.14 49.69
torch fp8: linear & all-gather 1 14587.93 34.26
torch fp8: linear & all-gather 2 17912.73 49.70
torch fp8: linear 1 13897.13 40.64
torch fp8: linear 2 17219.84 56.12
torch thunder_inductor_cat_cudnn_dynamo 1 12506.03 40.21
torch thunder_inductor_cat_cudnn_dynamo 2 13890.66 61.82
thunder inductor_cat_cudnn_transformerengine 1 14274.20 52.88
thunder inductor_cat_cudnn_transformerengine 2 16329.36 74.01

@crcrpar crcrpar requested a review from kshitij12345 August 19, 2024 13:35
@crcrpar crcrpar force-pushed the crpa/fsdp2-torchao_float8 branch 2 times, most recently from a88b136 to df54b94 Compare August 21, 2024 05:36
@crcrpar crcrpar marked this pull request as ready for review August 21, 2024 05:36
@crcrpar crcrpar force-pushed the crpa/fsdp2-torchao_float8 branch 4 times, most recently from e01a625 to b487ddf Compare August 22, 2024 18:14
Signed-off-by: Masaki Kozuki <mkozuki@nvidia.com>
Signed-off-by: Masaki Kozuki <mkozuki@nvidia.com>
Signed-off-by: Masaki Kozuki <mkozuki@nvidia.com>
Signed-off-by: Masaki Kozuki <mkozuki@nvidia.com>
Signed-off-by: Masaki Kozuki <mkozuki@nvidia.com>
Signed-off-by: Masaki Kozuki <mkozuki@nvidia.com>
…aling

Signed-off-by: Masaki Kozuki <mkozuki@nvidia.com>
Signed-off-by: Masaki Kozuki <mkozuki@nvidia.com>
@crcrpar crcrpar force-pushed the crpa/fsdp2-torchao_float8 branch from b487ddf to 3a747cc Compare August 23, 2024 01:15
Copy link
Collaborator

@lantiga lantiga left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Stamped!

@IvanYashchuk IvanYashchuk merged commit bdc1e8a into main Aug 23, 2024
37 checks passed
@IvanYashchuk IvanYashchuk deleted the crpa/fsdp2-torchao_float8 branch August 23, 2024 07:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants