Skip to content

Use the throughput utility for benchmarking#21

Merged
t-vi merged 1 commit intoLightning-AI:mainfrom
carmocca:carmocca/update-benchmark
Mar 21, 2024
Merged

Use the throughput utility for benchmarking#21
t-vi merged 1 commit intoLightning-AI:mainfrom
carmocca:carmocca/update-benchmark

Conversation

@carmocca
Copy link
Contributor

What does this PR do?

Integrates the throughput measurement utilities that we already have. I didn't expose any other values such as batches/sec or samples/sec and instead kept the artifacts generated as the same.

Below is a sample output for a smaller model so that you can convince yourself that these changes result in the same numbers

Output before this PR

BENCHMARK_OUT_FORMAT=print CUDA_VISIBLE_DEVICES=4,5,6,7 pytest examples/lit-gpt/test_parametrized.py -v -s

AVERAGE ITERATION TIME (ms)
compiler model_name  Num GPUS  Seq Len  Micro BS  Global BS  GA           Distributed Mode Sharding Size       eager    inductor
0             phi-2         1     2048         1          1   1                       none          none  228.394457  223.402629
1             phi-2         4     2048         1          4   1  fsdp_zero2_none_bucketing          none  254.056582  243.191038

THROUGHPUT (tokens/s)
compiler model_name  Num GPUS  Seq Len  Micro BS  Global BS  GA           Distributed Mode Sharding Size         eager      inductor
0             phi-2         1     2048         1          1   1                       none          none   8966.942670   9167.304818
1             phi-2         4     2048         1          4   1  fsdp_zero2_none_bucketing          none  32244.785525  33685.451807

NORMALIZED THROUGHPUT (tokens/s/GPU)
compiler model_name  Num GPUS  Seq Len  Micro BS  Global BS  GA           Distributed Mode Sharding Size        eager     inductor
0             phi-2         1     2048         1          1   1                       none          none  8966.942670  9167.304818
1             phi-2         4     2048         1          4   1  fsdp_zero2_none_bucketing          none  8061.196381  8421.362952

MEMORY ALLOCATED (GB)
compiler model_name  Num GPUS  Seq Len  Micro BS  Global BS  GA           Distributed Mode Sharding Size      eager   inductor
0             phi-2         1     2048         1          1   1                       none          none  23.328389  24.670566
1             phi-2         4     2048         1          4   1  fsdp_zero2_none_bucketing          none  16.335063  17.677240

torchrun --nproc_per_node=4 --nnodes=1 /home/carlos/lightning-thunder/thunder/benchmarks/benchmark_litgpt.py --return_metrics_as_json=True --json_path=/tmp/benchmark_litgpt_data.json --distributed_mode=fsdp --shard_mode=zero2 --model_name=phi-2 --micro_batch_size=1 --compile=eager --nsys_enabled=False --dynamic=False

Model name: phi-2
Seq Length: 2048
Micro BS: 1
Global BS: 4
Number of Layers: 32
Number of parameters: 0.69B
Distributed Mode: fsdp
Sharding Mode: zero2
Sharding Size: None
Bucketing: none
Compiler: eager
Average iter time: 254.59 ms
Memory used: 16.34 GB
Throughput (Tokens/s): 32177.28 tokens/s
Normalized Throughput (Tokens/s/GPU): 8044.32 tokens/s/gpu

Output after this PR

AVERAGE ITERATION TIME (ms)
compiler model_name  Num GPUS  Seq Len  Micro BS  Global BS  GA           Distributed Mode Sharding Size       eager   inductor
0             phi-2         1     2048         1          1   1                       none          none  228.613844  223.16100
1             phi-2         4     2048         1          4   1  fsdp_zero2_none_bucketing          none  253.434114  242.95623

THROUGHPUT (tokens/s)
compiler model_name  Num GPUS  Seq Len  Micro BS  Global BS  GA           Distributed Mode Sharding Size         eager      inductor
0             phi-2         1     2048         1          1   1                       none          none   8958.255246   9176.770569
1             phi-2         4     2048         1          4   1  fsdp_zero2_none_bucketing          none  32323.473907  33726.136855

NORMALIZED THROUGHPUT (tokens/s/GPU)
compiler model_name  Num GPUS  Seq Len  Micro BS  Global BS  GA           Distributed Mode Sharding Size        eager     inductor
0             phi-2         1     2048         1          1   1                       none          none  8958.255246  9176.770569
1             phi-2         4     2048         1          4   1  fsdp_zero2_none_bucketing          none  8080.868477  8431.534214

MEMORY ALLOCATED (GB)
compiler model_name  Num GPUS  Seq Len  Micro BS  Global BS  GA           Distributed Mode Sharding Size      eager   inductor
0             phi-2         1     2048         1          1   1                       none          none  23.328389  24.670566
1             phi-2         4     2048         1          4   1  fsdp_zero2_none_bucketing          none  16.335063  17.677240
Model name: phi-2
Seq Length: 2048
Micro BS: 1
Global BS: 4
Number of Layers: 32
Number of parameters: 0.69B
Distributed Mode: fsdp
Sharding Mode: zero2
Sharding Size: None
Bucketing: none
Compiler: eager
Average iter time: 254.84 ms
Memory used: 16.34 GB
Tokens/s: 32144.11
Tokens/s/GPU: 8036.03
TFLOP/s: 575.35

@carmocca carmocca requested review from lantiga and t-vi as code owners March 20, 2024 23:16
@carmocca carmocca self-assigned this Mar 20, 2024
@t-vi t-vi merged commit 348597f into Lightning-AI:main Mar 21, 2024
@t-vi
Copy link
Collaborator

t-vi commented Mar 21, 2024

Thank you @carmocca

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants