LitGPT benchmarking: Use native PyTorch checkpointing in the dynamo+thunder path #1370

kiya00 · 2024-10-30T15:46:16Z

Use the native PyTorch checkpoint option in litgpt benchmark for the Thunder Dynamo path

H100*8 ZeRO3 with checkpointing
torchrun --nproc_per_node=8 --nnodes=1 thunder/benchmarks/benchmark_litgpt.py --model_name CodeLlama-34b-hf --micro_batch_size 1 --compile thunder-dynamo --checkpoint_activations=True --distributed_mode=fsdp --shard_mode zero3 --max_iters=4 --warmup_iters=1

	micro batch size	peak mem
longchat-13b-16k	3	50.40 GB
CodeLlama-34b-hf	1	48.07 GB
Gemma-2-27b	1	OOM
Llama-3-70B	1	78.77 GB
Mistral-7B-v0.2	3	67.44 GB
vicuna-7b-v1.5-16k	5	54.81 GB

Note:
This PR enable the ThunderFX + native PyTorch checkpointing

Single GPU:
the splitter creates the module as follows:

GraphModule(
  (thunder_1): ThunderModule(
    (_model): GraphModule()
  )
  (inductor_2): OptimizedModule(
    (_orig_mod): GraphModule(
      (wrap_body_0): GraphModule()
    )
  )
  (thunder_3): ThunderModule(
    (_model): GraphModule()
  )
  (inductor_4): OptimizedModule(
    (_orig_mod): GraphModule(
      (wrap_body_1): GraphModule()
    )
  )
  (thunder_5): ThunderModule(
    (_model): GraphModule()
  )
)

The checkpoint operator is not supported by Thunder and fallback to running with inductor(the converter PR #1261 can fix this)

ZeRO3:
Dynamo only passes parts of the original model to the backend (the gm in ThunderCompiler.__call__) that do not contain a checkpoint operator when --bucketing_mode=none is used.

IvanYashchuk · 2024-10-30T18:58:49Z

When --bucketing_mode=block is used then Dynamo starts sending the graphs with a torch.ops.higher_order.tag_activation_checkpoint operator inside for which we need special processing added in #1261.

IvanYashchuk · 2024-10-30T19:34:31Z

@t-vi, can you please merge this pull request?

t-vi

Thank you @kiya00 @IvanYashchuk

Add native PyTorch checkpoint in litgpt benchmark (#1298)

f6aed6b

kiya00 requested review from crcrpar and IvanYashchuk October 30, 2024 15:46

kiya00 requested review from mruberry, lantiga and t-vi as code owners October 30, 2024 15:46

IvanYashchuk approved these changes Oct 30, 2024

View reviewed changes

IvanYashchuk changed the title ~~Add native PyTorch checkpoint in litgpt benchmark (#1298)~~ LitGPT benchmarking: Use native PyTorch checkpointing in the dynamo+thunder path Oct 30, 2024

IvanYashchuk added memory use thunderfx for things that could be applicable to the dynamo+thunder frontend labels Oct 30, 2024

IvanYashchuk mentioned this pull request Oct 30, 2024

Thunder's horizontal fusion is memory inefficient for backward functions with activation checkpointing #1337

Open

kiya00 mentioned this pull request Oct 30, 2024

A converter for FXGraph with Torch calls -> FXGraph with Thunder calls #1261

Merged

IvanYashchuk removed the request for review from crcrpar October 30, 2024 18:17

IvanYashchuk enabled auto-merge (squash) October 30, 2024 19:35

t-vi approved these changes Oct 31, 2024

View reviewed changes

IvanYashchuk merged commit 7b52be0 into main Oct 31, 2024
43 checks passed

IvanYashchuk deleted the ckp-benchmark branch October 31, 2024 08:28

kiya00 mentioned this pull request Nov 19, 2024

LitGPT benchmarking with Dynamo + activation checkpointing creates a graph break #1298

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LitGPT benchmarking: Use native PyTorch checkpointing in the dynamo+thunder path #1370

LitGPT benchmarking: Use native PyTorch checkpointing in the dynamo+thunder path #1370

kiya00 commented Oct 30, 2024 •

edited by IvanYashchuk

Loading

IvanYashchuk commented Oct 30, 2024

IvanYashchuk commented Oct 30, 2024

t-vi left a comment

LitGPT benchmarking: Use native PyTorch checkpointing in the dynamo+thunder path #1370

LitGPT benchmarking: Use native PyTorch checkpointing in the dynamo+thunder path #1370

Conversation

kiya00 commented Oct 30, 2024 • edited by IvanYashchuk Loading

IvanYashchuk commented Oct 30, 2024

IvanYashchuk commented Oct 30, 2024

t-vi left a comment

Choose a reason for hiding this comment

kiya00 commented Oct 30, 2024 •

edited by IvanYashchuk

Loading