Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a benchmark for portions of LitGPT model other than SDPA #148

Closed
wants to merge 24 commits into from

Conversation

IvanYashchuk
Copy link
Collaborator

@IvanYashchuk IvanYashchuk commented Apr 9, 2024

This PR adds a new benchmark, run it with

pytest thunder/benchmarks/litgpt_chunks.py --benchmark-group-by='group,param:info' --benchmark-columns='min,max,mean,stddev,median'

The intent is to be able to compare performance on sections of the GPT network that are not covered by a FlashAttention kernel.

Constructing benchmark cases is slow and -s gives a progress bar:

Constructing benchmark cases for config: CodeLlama-13b-hf (1/39)
Constructing benchmark cases for config: CodeLlama-34b-hf (2/39)
Constructing benchmark cases for config: CodeLlama-70b-hf (3/39)
Constructing benchmark cases for config: CodeLlama-7b-hf (4/39)
Constructing benchmark cases for config: Gemma-2b-it (5/39)
Constructing benchmark cases for config: Gemma-7b-it (6/39)
Constructing benchmark cases for config: Mistral-7B-v0.1 (7/39)
Constructing benchmark cases for config: Nous-Hermes-13b (8/39)
Constructing benchmark cases for config: Nous-Hermes-Llama2-13b (9/39)
Constructing benchmark cases for config: Platypus2-70B (10/39)
Constructing benchmark cases for config: Platypus2-70B-instruct (11/39)
Constructing benchmark cases for config: RedPajama-INCITE-Instruct-3B-v1 (12/39)
Constructing benchmark cases for config: dolly-v2-12b (13/39)
Constructing benchmark cases for config: dolly-v2-3b (14/39)
Constructing benchmark cases for config: dolly-v2-7b (15/39)
Constructing benchmark cases for config: falcon-180B-chat (16/39)
Constructing benchmark cases for config: falcon-40b-instruct (17/39)
Constructing benchmark cases for config: falcon-7b-instruct (18/39)
Constructing benchmark cases for config: open_llama_3b (19/39)
Constructing benchmark cases for config: phi-1_5 (20/39)
Constructing benchmark cases for config: phi-2 (21/39)
Constructing benchmark cases for config: pythia-1.4b-deduped (22/39)
Constructing benchmark cases for config: pythia-12b-deduped (23/39)
Constructing benchmark cases for config: pythia-14m (24/39)
Constructing benchmark cases for config: pythia-160m-deduped (25/39)
Constructing benchmark cases for config: pythia-1b-deduped (26/39)
Constructing benchmark cases for config: pythia-2.8b-deduped (27/39)
Constructing benchmark cases for config: pythia-31m (28/39)
Constructing benchmark cases for config: pythia-410m-deduped (29/39)
Constructing benchmark cases for config: pythia-6.9b-deduped (30/39)
Constructing benchmark cases for config: pythia-70m-deduped (31/39)
Constructing benchmark cases for config: stablecode-instruct-alpha-3b (32/39)
Constructing benchmark cases for config: stablelm-tuned-alpha-3b (33/39)
Constructing benchmark cases for config: stablelm-tuned-alpha-7b (34/39)
Constructing benchmark cases for config: stablelm-zephyr-3b (35/39)
Constructing benchmark cases for config: tiny-llama-1.1b-chat (36/39)
Constructing benchmark cases for config: vicuna-13b-v1.5-16k (37/39)
Constructing benchmark cases for config: vicuna-33b-v1.3 (38/39)
Constructing benchmark cases for config: vicuna-7b-v1.5-16k (39/39)

and it takes 6 minutes to generate the test cases:

468 tests collected in 377.53s (0:06:17)

It's possible to control the batch size by modifying BATCH_SIZE in litgpt_chunks.py and which configs to benchmark can be controlled by modifying the CONFIG_NAMES list.

Thunder is used for tracing the litgpt code and then the trace is split into chunks with the SDPA call as a delimiter. Since the GPT model has a for-loop structure it's enough to trace a model with just two transformer blocks. It gives the following program chunks:

We could save the result of generating test cases to a disk, but that's left as an exercise for the future.

TODO:

  • Run the benchmark on H100

cc @crcrpar @kevinstephano


BATCH_SIZE = 2
CONFIG_NAMES = list(sorted(c["name"] for c in configs))
# CONFIG_NAMES = ["Llama-2-7b-hf",]
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Uncommenting this would force generating benchmark cases just for this Llama 2 7B config.

Comment on lines 732 to 733
inductor_cutlass_executor = partial(inductor_gemm_executor, gemm_backend="ATEN,CUTLASS")
inductor_triton_executor = partial(inductor_gemm_executor, gemm_backend="ATEN,TRITON")
Copy link
Collaborator Author

@IvanYashchuk IvanYashchuk Apr 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Maybe "ATEN" should be removed here forcing Inductor to use cutlass or triton for gemms. I should try if it works without any errors.

Comment on lines +202 to +204
# litgpt_traces = [
# TraceInfo(name, i, trace) for name in CONFIG_NAMES for i, trace in enumerate(make_torch_traces_for_config(name))
# ]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this still needed?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

List comprehensions are easier to read for me than the for-loop below.
I'll remove this of course.

Copy link
Collaborator

@riccardofelluga riccardofelluga left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me! I've played a bit with the make_torch_traces_for_config and it does the job as long as the part that we are interested in comes after sdpa.

@IvanYashchuk IvanYashchuk marked this pull request as draft April 11, 2024 10:55
Try with
```
pytest thunder/benchmarks/litgpt_chunks.py --benchmark-group-by='group,param:info' --benchmark-columns='min,max,mean,stddev,median'
```
@lantiga
Copy link
Collaborator

lantiga commented May 30, 2024

Hey @IvanYashchuk should we revive this or close for now? We can add a label for PRs we close that could be potentially of interest for the future.

@IvanYashchuk
Copy link
Collaborator Author

I've put it to draft to prevent merging because I need more time to think about it and convince myself again that it's something we need in the project.
I prefer to keep it in the draft stage to remind myself about it every day.

@lantiga
Copy link
Collaborator

lantiga commented May 31, 2024

"draft" is the new browser tab haha

@lantiga lantiga added the later label Jul 3, 2024
@lantiga lantiga closed this Jul 3, 2024
@t-vi t-vi deleted the litgpt-chunks-bench branch July 16, 2024 12:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants