refactor recomputation to work with tags #1615

t-vi · 2025-01-07T20:43:49Z

Another step of #1560

This refactors the recomputation (activation checkpointing):

recomputation works with tags,
automatically tag intermediates from autograd on decomposition as to be recomputed, (this matches the behaviour of rematerialize_forward_backward, I think),
I needed to disable rematerialize_forward_backward because it ran into "infinite capacity". However, I think this is not needed after this PR (cc @IvanYashchuk),
move the uniform -> get_and_update_random_state + uniform_philox transform to before the autograd
guard against recomputing random ops,

This is expected to be memory/compute neutral (I'll report numbers in a bit). It does not yet do the checkpointing frontend for the jit (including using memory commparable to eager checkpointing), that will be a separate PR.

@riccardofelluga could you take a look? (should be no surprises relative to #1560)

t-vi · 2025-01-07T20:58:24Z

So there is quite a regression in memory use:

python thunder/benchmarks/benchmark_litgpt.py --model_name stablecode-completion-alpha-3b --compile thunder --checkpoint_activations True --low_precision_mode none --micro_batch_size 1 --n_layer 4 --max_iters 3 --warmup_iters 2  --dump_thunder_traces True

gives

Main: Average iter time: 1279.73 ms, Memory used: 13.27 GB
Average iter time: 1146.84 ms, Memory used: 13.29 GB

So 10% faster and not that much more memory (not sure what the 0.02GB are...)

t-vi · 2025-01-07T22:22:12Z

So I'm not entirely happy with disabling the rematerialize_forward_and_backward, but I wonder if the need arises from proxies that are re-computed and thus appear twice in the joint_extrace. (Hello name collision.)
I'll try that either here or in a follow-up (likely by renaming all fw tensors to fw_... and all bw tensors bw_ ... or somesuch.

t-vi · 2025-01-08T10:23:17Z

Unfortunately, disabling the forward-backward-rematerialization adversely affects memory for Qwen, but happily we recover that when we do better checkpointing:

python thunder/benchmarks/benchmark_litgpt.py --model_name Qwen2.5-7B --compile thunder --checkpoint_activations True --low_precision_mode none --micro_batch_size 1 --n_layer 4 --max_iters 3 --warmup_iters 2 --block_size 4096

Main: Average iter time: 788.88 ms, Memory used: 18.76 GB
This PR: Average iter time: 686.02 ms, Memory used: 20.77 GB
PR use tagging checkpointing #1616 : Average iter time: 779.80 ms, Memory used: 18.34 GB
Eager instead of thunder: Average iter time: 803.95 ms Memory used: 17.55 GB

riccardofelluga

Overall it looks great! Just a couple of nits and clarifications

thunder/executors/torch_autograd.py

thunder/core/trace_interpreter.py

thunder/core/transforms.py

IvanYashchuk · 2025-01-08T17:13:30Z

thunder/tests/test_grad.py

+    jfn = thunder.jit(fn, enable_saved_for_backward_recomputation=False)
+    jfn2 = thunder.jit(fn, enable_saved_for_backward_recomputation=True)


What is the default value now?

IvanYashchuk · 2025-01-08T17:39:12Z

automatically tag intermediates from autograd on decomposition as to be recomputed, (this matches the behaviour of rematerialize_forward_backward, I think),

No, it doesn't match the behavior of rematerialize_forward_backward. What to recompute in fusion regions is decided by a min-cut-based algorithm. This PR introduced a regression in peak used memory (checked for Llama 2 7B): #1621.

I needed to disable rematerialize_forward_backward because it ran into "infinite capacity".

The rematerialization code has assumptions on input traces to be functioning, these assumptions were violated resulting in the "infinite capacity" error. @riccardofelluga was hitting the same problem when working on the recomputation. The problems are supposed to be fixed with #1367. Riccardo, what's the current status of 1367?

riccardofelluga · 2025-01-09T09:30:53Z

#1367 was parked due to change of priorities, we could tho bring that back here adapting to the new saved-for-backward logic of #1615. Tho it can also be that the infinite capacity error is reached by a different cause

refactor recomputation to work with tags

68563dc

t-vi requested review from mruberry and lantiga as code owners January 7, 2025 20:43

fix the function-passing and update tests

e906009

fix fsdp test

2979681

riccardofelluga self-requested a review January 8, 2025 09:52

t-vi mentioned this pull request Jan 8, 2025

use tagging checkpointing #1616

Merged

riccardofelluga approved these changes Jan 8, 2025

View reviewed changes

thunder/executors/torch_autograd.py Outdated Show resolved Hide resolved

thunder/core/trace_interpreter.py Outdated Show resolved Hide resolved

thunder/core/transforms.py Show resolved Hide resolved

thunder/core/transforms.py Outdated Show resolved Hide resolved

review comments. Thank you Riccardo

0929eb4

t-vi enabled auto-merge (squash) January 8, 2025 14:42

t-vi mentioned this pull request Jan 8, 2025

avoid joint trace in rematerialize forward backward #1618

Open

lantiga approved these changes Jan 8, 2025

View reviewed changes

t-vi disabled auto-merge January 8, 2025 15:35

t-vi merged commit e536ddc into main Jan 8, 2025
38 of 41 checks passed

t-vi deleted the tom/recomputation-refactor branch January 8, 2025 15:35

IvanYashchuk reviewed Jan 8, 2025

View reviewed changes

IvanYashchuk mentioned this pull request Jan 8, 2025

Llama 2 7B OOMs on 80GB card #1621

Closed

This was referenced Jan 9, 2025

TransformerEngine Tests failure: Assertion failed: A_scale_inverse.is_cuda() && B_scale_inverse.is_cuda(). Scaling factors must be on device. #1624

Closed

Perf Regression : SDPA is recomputed in backward #1646

Closed

IvanYashchuk mentioned this pull request Jan 18, 2025

Disable enable_saved_for_backward_recomputation path by default #1659

Closed

riccardofelluga pushed a commit that referenced this pull request Jan 27, 2025

refactor recomputation to work with tags (#1615)

5ece2ed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refactor recomputation to work with tags #1615

refactor recomputation to work with tags #1615

t-vi commented Jan 7, 2025

t-vi commented Jan 7, 2025 •

edited

Loading

t-vi commented Jan 7, 2025 •

edited

Loading

t-vi commented Jan 8, 2025 •

edited

Loading

riccardofelluga left a comment

IvanYashchuk Jan 8, 2025

IvanYashchuk commented Jan 8, 2025

riccardofelluga commented Jan 9, 2025

		jfn = thunder.jit(fn, enable_saved_for_backward_recomputation=False)
		jfn2 = thunder.jit(fn, enable_saved_for_backward_recomputation=True)

refactor recomputation to work with tags #1615

refactor recomputation to work with tags #1615

Conversation

t-vi commented Jan 7, 2025

t-vi commented Jan 7, 2025 • edited Loading

t-vi commented Jan 7, 2025 • edited Loading

t-vi commented Jan 8, 2025 • edited Loading

riccardofelluga left a comment

Choose a reason for hiding this comment

IvanYashchuk Jan 8, 2025

Choose a reason for hiding this comment

IvanYashchuk commented Jan 8, 2025

riccardofelluga commented Jan 9, 2025

t-vi commented Jan 7, 2025 •

edited

Loading

t-vi commented Jan 7, 2025 •

edited

Loading

t-vi commented Jan 8, 2025 •

edited

Loading