Add a benchmark for portions of LitGPT model other than SDPA #148

IvanYashchuk · 2024-04-09T13:50:11Z

This PR adds a new benchmark, run it with

pytest thunder/benchmarks/litgpt_chunks.py --benchmark-group-by='group,param:info' --benchmark-columns='min,max,mean,stddev,median'

The intent is to be able to compare performance on sections of the GPT network that are not covered by a FlashAttention kernel.

Constructing benchmark cases is slow and -s gives a progress bar:

Constructing benchmark cases for config: CodeLlama-13b-hf (1/39)
Constructing benchmark cases for config: CodeLlama-34b-hf (2/39)
Constructing benchmark cases for config: CodeLlama-70b-hf (3/39)
Constructing benchmark cases for config: CodeLlama-7b-hf (4/39)
Constructing benchmark cases for config: Gemma-2b-it (5/39)
Constructing benchmark cases for config: Gemma-7b-it (6/39)
Constructing benchmark cases for config: Mistral-7B-v0.1 (7/39)
Constructing benchmark cases for config: Nous-Hermes-13b (8/39)
Constructing benchmark cases for config: Nous-Hermes-Llama2-13b (9/39)
Constructing benchmark cases for config: Platypus2-70B (10/39)
Constructing benchmark cases for config: Platypus2-70B-instruct (11/39)
Constructing benchmark cases for config: RedPajama-INCITE-Instruct-3B-v1 (12/39)
Constructing benchmark cases for config: dolly-v2-12b (13/39)
Constructing benchmark cases for config: dolly-v2-3b (14/39)
Constructing benchmark cases for config: dolly-v2-7b (15/39)
Constructing benchmark cases for config: falcon-180B-chat (16/39)
Constructing benchmark cases for config: falcon-40b-instruct (17/39)
Constructing benchmark cases for config: falcon-7b-instruct (18/39)
Constructing benchmark cases for config: open_llama_3b (19/39)
Constructing benchmark cases for config: phi-1_5 (20/39)
Constructing benchmark cases for config: phi-2 (21/39)
Constructing benchmark cases for config: pythia-1.4b-deduped (22/39)
Constructing benchmark cases for config: pythia-12b-deduped (23/39)
Constructing benchmark cases for config: pythia-14m (24/39)
Constructing benchmark cases for config: pythia-160m-deduped (25/39)
Constructing benchmark cases for config: pythia-1b-deduped (26/39)
Constructing benchmark cases for config: pythia-2.8b-deduped (27/39)
Constructing benchmark cases for config: pythia-31m (28/39)
Constructing benchmark cases for config: pythia-410m-deduped (29/39)
Constructing benchmark cases for config: pythia-6.9b-deduped (30/39)
Constructing benchmark cases for config: pythia-70m-deduped (31/39)
Constructing benchmark cases for config: stablecode-instruct-alpha-3b (32/39)
Constructing benchmark cases for config: stablelm-tuned-alpha-3b (33/39)
Constructing benchmark cases for config: stablelm-tuned-alpha-7b (34/39)
Constructing benchmark cases for config: stablelm-zephyr-3b (35/39)
Constructing benchmark cases for config: tiny-llama-1.1b-chat (36/39)
Constructing benchmark cases for config: vicuna-13b-v1.5-16k (37/39)
Constructing benchmark cases for config: vicuna-33b-v1.3 (38/39)
Constructing benchmark cases for config: vicuna-7b-v1.5-16k (39/39)

and it takes 6 minutes to generate the test cases:

468 tests collected in 377.53s (0:06:17)

It's possible to control the batch size by modifying BATCH_SIZE in litgpt_chunks.py and which configs to benchmark can be controlled by modifying the CONFIG_NAMES list.

Thunder is used for tracing the litgpt code and then the trace is split into chunks with the SDPA call as a delimiter. Since the GPT model has a for-loop structure it's enough to trace a model with just two transformer blocks. It gives the following program chunks:

Embedding (https://github.com/Lightning-AI/litgpt/blob/78bd4cae1e655359e92e7c7b830fcbaa4c15c152/litgpt/model.py#L89-L91) -> Beginning of a transformer block until SDPA
From one SDPA call to another (so merging the end and start of a transformer block)
End of a transformer block -> ln_f+lm_head (https://github.com/Lightning-AI/litgpt/blob/78bd4cae1e655359e92e7c7b830fcbaa4c15c152/litgpt/model.py#L95-L96)

We could save the result of generating test cases to a disk, but that's left as an exercise for the future.

TODO:

Run the benchmark on H100

cc @crcrpar @kevinstephano

for more information, see https://pre-commit.ci

IvanYashchuk · 2024-04-09T13:53:40Z

thunder/benchmarks/litgpt_chunks.py

+
+BATCH_SIZE = 2
+CONFIG_NAMES = list(sorted(c["name"] for c in configs))
+# CONFIG_NAMES = ["Llama-2-7b-hf",]


Uncommenting this would force generating benchmark cases just for this Llama 2 7B config.

IvanYashchuk · 2024-04-10T15:28:37Z

thunder/benchmarks/__init__.py

+inductor_cutlass_executor = partial(inductor_gemm_executor, gemm_backend="ATEN,CUTLASS")
+inductor_triton_executor = partial(inductor_gemm_executor, gemm_backend="ATEN,TRITON")


Maybe "ATEN" should be removed here forcing Inductor to use cutlass or triton for gemms. I should try if it works without any errors.

riccardofelluga · 2024-04-10T15:46:27Z

thunder/benchmarks/litgpt_chunks.py

+# litgpt_traces = [
+#     TraceInfo(name, i, trace) for name in CONFIG_NAMES for i, trace in enumerate(make_torch_traces_for_config(name))
+# ]


Is this still needed?

List comprehensions are easier to read for me than the for-loop below.
I'll remove this of course.

riccardofelluga

Looks good to me! I've played a bit with the make_torch_traces_for_config and it does the job as long as the part that we are interested in comes after sdpa.

for more information, see https://pre-commit.ci

Try with ``` pytest thunder/benchmarks/litgpt_chunks.py --benchmark-group-by='group,param:info' --benchmark-columns='min,max,mean,stddev,median' ```

lantiga · 2024-05-30T18:10:13Z

Hey @IvanYashchuk should we revive this or close for now? We can add a label for PRs we close that could be potentially of interest for the future.

IvanYashchuk · 2024-05-31T08:05:53Z

I've put it to draft to prevent merging because I need more time to think about it and convince myself again that it's something we need in the project.
I prefer to keep it in the draft stage to remind myself about it every day.

lantiga · 2024-05-31T16:55:26Z

"draft" is the new browser tab haha

IvanYashchuk added 17 commits April 9, 2024 14:46

Allow skipping no_grad decorator

552f415

Add torchctx to thunder.torch

79b136b

Add initial version of chunking litgpt network and benchmarking

3427326

Add parametrization over the config name

9d02ea7

Use TraceInfo dataclass instead of plain tuples

26493fa

batch_size -> BATCH_SIZE

e87c95d

Generate cases for all litgpt configs

8b51449

Add region index to TraceInfo

f08cf10

Move make_tensor import to the header

ba9f360

Skip mixtral

2706b69

Print progress

d189fda

Filter out non unique configs

5f70895

Add inductor with gemm autotuning

1b73e3c

Use CONFIG_NAME when constructing unique_config_names

995229b

formatting

ce227ad

Skip inductor+cutlass as it's raising an error

101fae4

Rename executors to include gemm in the name

2cbc8a4

IvanYashchuk added the benchmarking label Apr 9, 2024

IvanYashchuk requested review from mruberry, lantiga, robieta, t-vi and carmocca as code owners April 9, 2024 13:50

[pre-commit.ci] auto fixes from pre-commit.com hooks

e61d501

for more information, see https://pre-commit.ci

IvanYashchuk commented Apr 9, 2024

View reviewed changes

IvanYashchuk requested a review from riccardofelluga April 9, 2024 15:45

IvanYashchuk commented Apr 10, 2024

View reviewed changes

riccardofelluga reviewed Apr 10, 2024

View reviewed changes

riccardofelluga approved these changes Apr 11, 2024

View reviewed changes

Merge branch 'main' into litgpt-chunks-bench

377f675

IvanYashchuk marked this pull request as draft April 11, 2024 10:55

IvanYashchuk and others added 2 commits April 17, 2024 15:16

The type of general_jit return was changed in #115

9cb3ca0

[pre-commit.ci] auto fixes from pre-commit.com hooks

2799f11

for more information, see https://pre-commit.ci

github-actions bot added the has conflicts label Apr 17, 2024

IvanYashchuk added 3 commits April 18, 2024 13:28

Merge remote-tracking branch 'upstream/main' into litgpt-chunks-bench

e351685

Remove ATEN from gemm_backend

fe64f0e

Split benchmarks into forward and backward; make pytest-benchmark groups

57cec8a

Try with ``` pytest thunder/benchmarks/litgpt_chunks.py --benchmark-group-by='group,param:info' --benchmark-columns='min,max,mean,stddev,median' ```

github-actions bot added has conflicts and removed has conflicts labels Apr 18, 2024

lantiga added the later label Jul 3, 2024

lantiga closed this Jul 3, 2024

t-vi deleted the litgpt-chunks-bench branch July 16, 2024 12:48

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add a benchmark for portions of LitGPT model other than SDPA #148

Add a benchmark for portions of LitGPT model other than SDPA #148

IvanYashchuk commented Apr 9, 2024 •

edited

Loading

IvanYashchuk Apr 9, 2024

IvanYashchuk Apr 10, 2024 •

edited

Loading

riccardofelluga Apr 10, 2024

IvanYashchuk Apr 10, 2024

riccardofelluga left a comment

lantiga commented May 30, 2024

IvanYashchuk commented May 31, 2024

lantiga commented May 31, 2024

		inductor_cutlass_executor = partial(inductor_gemm_executor, gemm_backend="ATEN,CUTLASS")
		inductor_triton_executor = partial(inductor_gemm_executor, gemm_backend="ATEN,TRITON")

Add a benchmark for portions of LitGPT model other than SDPA #148

Add a benchmark for portions of LitGPT model other than SDPA #148

Conversation

IvanYashchuk commented Apr 9, 2024 • edited Loading

IvanYashchuk Apr 9, 2024

Choose a reason for hiding this comment

IvanYashchuk Apr 10, 2024 • edited Loading

Choose a reason for hiding this comment

riccardofelluga Apr 10, 2024

Choose a reason for hiding this comment

IvanYashchuk Apr 10, 2024

Choose a reason for hiding this comment

riccardofelluga left a comment

Choose a reason for hiding this comment

lantiga commented May 30, 2024

IvanYashchuk commented May 31, 2024

lantiga commented May 31, 2024

IvanYashchuk commented Apr 9, 2024 •

edited

Loading

IvanYashchuk Apr 10, 2024 •

edited

Loading