Use the throughput utility for benchmarking by carmocca · Pull Request #21 · Lightning-AI/lightning-thunder

carmocca · 2024-03-20T23:16:23Z

What does this PR do?

Integrates the throughput measurement utilities that we already have. I didn't expose any other values such as batches/sec or samples/sec and instead kept the artifacts generated as the same.

Below is a sample output for a smaller model so that you can convince yourself that these changes result in the same numbers

Output before this PR

BENCHMARK_OUT_FORMAT=print CUDA_VISIBLE_DEVICES=4,5,6,7 pytest examples/lit-gpt/test_parametrized.py -v -s

AVERAGE ITERATION TIME (ms)
compiler model_name  Num GPUS  Seq Len  Micro BS  Global BS  GA           Distributed Mode Sharding Size       eager    inductor
0             phi-2         1     2048         1          1   1                       none          none  228.394457  223.402629
1             phi-2         4     2048         1          4   1  fsdp_zero2_none_bucketing          none  254.056582  243.191038

THROUGHPUT (tokens/s)
compiler model_name  Num GPUS  Seq Len  Micro BS  Global BS  GA           Distributed Mode Sharding Size         eager      inductor
0             phi-2         1     2048         1          1   1                       none          none   8966.942670   9167.304818
1             phi-2         4     2048         1          4   1  fsdp_zero2_none_bucketing          none  32244.785525  33685.451807

NORMALIZED THROUGHPUT (tokens/s/GPU)
compiler model_name  Num GPUS  Seq Len  Micro BS  Global BS  GA           Distributed Mode Sharding Size        eager     inductor
0             phi-2         1     2048         1          1   1                       none          none  8966.942670  9167.304818
1             phi-2         4     2048         1          4   1  fsdp_zero2_none_bucketing          none  8061.196381  8421.362952

MEMORY ALLOCATED (GB)
compiler model_name  Num GPUS  Seq Len  Micro BS  Global BS  GA           Distributed Mode Sharding Size      eager   inductor
0             phi-2         1     2048         1          1   1                       none          none  23.328389  24.670566
1             phi-2         4     2048         1          4   1  fsdp_zero2_none_bucketing          none  16.335063  17.677240

torchrun --nproc_per_node=4 --nnodes=1 /home/carlos/lightning-thunder/thunder/benchmarks/benchmark_litgpt.py --return_metrics_as_json=True --json_path=/tmp/benchmark_litgpt_data.json --distributed_mode=fsdp --shard_mode=zero2 --model_name=phi-2 --micro_batch_size=1 --compile=eager --nsys_enabled=False --dynamic=False

Model name: phi-2
Seq Length: 2048
Micro BS: 1
Global BS: 4
Number of Layers: 32
Number of parameters: 0.69B
Distributed Mode: fsdp
Sharding Mode: zero2
Sharding Size: None
Bucketing: none
Compiler: eager
Average iter time: 254.59 ms
Memory used: 16.34 GB
Throughput (Tokens/s): 32177.28 tokens/s
Normalized Throughput (Tokens/s/GPU): 8044.32 tokens/s/gpu

Output after this PR

AVERAGE ITERATION TIME (ms)
compiler model_name  Num GPUS  Seq Len  Micro BS  Global BS  GA           Distributed Mode Sharding Size       eager   inductor
0             phi-2         1     2048         1          1   1                       none          none  228.613844  223.16100
1             phi-2         4     2048         1          4   1  fsdp_zero2_none_bucketing          none  253.434114  242.95623

THROUGHPUT (tokens/s)
compiler model_name  Num GPUS  Seq Len  Micro BS  Global BS  GA           Distributed Mode Sharding Size         eager      inductor
0             phi-2         1     2048         1          1   1                       none          none   8958.255246   9176.770569
1             phi-2         4     2048         1          4   1  fsdp_zero2_none_bucketing          none  32323.473907  33726.136855

NORMALIZED THROUGHPUT (tokens/s/GPU)
compiler model_name  Num GPUS  Seq Len  Micro BS  Global BS  GA           Distributed Mode Sharding Size        eager     inductor
0             phi-2         1     2048         1          1   1                       none          none  8958.255246  9176.770569
1             phi-2         4     2048         1          4   1  fsdp_zero2_none_bucketing          none  8080.868477  8431.534214

MEMORY ALLOCATED (GB)
compiler model_name  Num GPUS  Seq Len  Micro BS  Global BS  GA           Distributed Mode Sharding Size      eager   inductor
0             phi-2         1     2048         1          1   1                       none          none  23.328389  24.670566
1             phi-2         4     2048         1          4   1  fsdp_zero2_none_bucketing          none  16.335063  17.677240

Model name: phi-2
Seq Length: 2048
Micro BS: 1
Global BS: 4
Number of Layers: 32
Number of parameters: 0.69B
Distributed Mode: fsdp
Sharding Mode: zero2
Sharding Size: None
Bucketing: none
Compiler: eager
Average iter time: 254.84 ms
Memory used: 16.34 GB
Tokens/s: 32144.11
Tokens/s/GPU: 8036.03
TFLOP/s: 575.35

t-vi · 2024-03-21T06:41:33Z

Thank you @carmocca

Use the throughput utility for benchmarking

b83fa3a

carmocca requested review from lantiga and t-vi as code owners March 20, 2024 23:16

carmocca self-assigned this Mar 20, 2024

t-vi merged commit 348597f into Lightning-AI:main Mar 21, 2024

kshitij12345 mentioned this pull request Oct 14, 2024

ThunderFX failure: KeyError: 'l_stack0_' #1293

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use the throughput utility for benchmarking#21

Use the throughput utility for benchmarking#21
t-vi merged 1 commit intoLightning-AI:mainfrom
carmocca:carmocca/update-benchmark

carmocca commented Mar 20, 2024

Uh oh!

t-vi commented Mar 21, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

carmocca commented Mar 20, 2024

What does this PR do?

Output before this PR

Output after this PR

Uh oh!

t-vi commented Mar 21, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants