update checkpointing support for jit #1560

t-vi · 2024-12-17T09:13:23Z

I'll add switches and testing of checkpointing, but here is the material code changes.

t-vi · 2024-12-17T20:58:27Z

There are a number of things still to be fixed.
There is a failure with the memory calculation because we don't update the initial collection proxy.
I'll look into these.

review-notebook-app · 2024-12-18T12:19:03Z

Check out this pull request on

See visual diffs & provide feedback on Jupyter Notebooks.

Powered by ReviewNB

t-vi · 2024-12-18T14:37:33Z

To my mind, the remaining missing bit is the handling of uniform -> uniform_philox conversion (currently handled by rematerialization) and the interaction with recomputation. I'll look into it. In the meantime, I'd be keen to hear complaints and/or success stories about memory / performance impact.

riccardofelluga · 2024-12-18T16:19:06Z

I was trying to check memory savings but it looks like the following hangs:

python thunder/benchmarks/benchmark_litgpt.py --model_name stablecode-completion-alpha-3b --compile thunder --checkpoint_activations True --low_precision_mode none --micro_batch_size 1

:(

t-vi · 2024-12-19T09:39:42Z

So one needs to enable checkpointing layers with compiler==thunder for this. Then the memory profile of the backward is still terrible:
Running @riccardofelluga 's benchmark:
Eager with checkpointing and 8 layers needs 12.59GB, thunder with checkpointing and 8 layers needs 38GB.
This is not the saved for backwards, which is (for thunder): Saved for backward size: 1761.89 MiB Saved for backward number of tensors: 103

I think we need to look more closely at the memory over time.

t-vi · 2024-12-19T12:06:01Z

Doing the following:

apply del last used to any trace of the last_backward_traces
run the memory examine tool (so awesome @kiya00 !)

We see that we are doing much better at first than after transform for execution (this is for 4 layers), so we still have reordering that hurts us.

#the following trace uses ~9.84GB memory
# Constructed by Saved for backward remat trace (took 20.46 milliseconds)
#the following trace uses ~12.39GB memory
# Constructed by Transform for execution (took 886 milliseconds)

Note that the difference of 2.55GB is smaller than the difference between thunder and eager (5.82GB).

…sing of the backward compiled function

t-vi · 2024-12-21T21:19:29Z

With the two latest bits

dumb down fusion logic to not change ordering,
splitting ThunderFunction to avoid the effect in Loss function backward output kept in memory longer than expected #1379,

I have that

python thunder/benchmarks/benchmark_litgpt.py --model_name stablecode-completion-alpha-3b --compile thunder --checkpoint_activations True --low_precision_mode none --micro_batch_size 1 --n_layer 4

is on par (even a little below) the same with --compile eager.
I will be looking at the failing tests and splitting out some bits that can be handled independently.

t-vi and others added 5 commits December 17, 2024 10:11

update checkpointing support for jit

b249beb

don't pass traces to policy

49b6f23

test updates

2a08222

switches

9e1d1c3

switches

32d8ccb

t-vi marked this pull request as ready for review December 17, 2024 11:35

t-vi requested review from mruberry and lantiga as code owners December 17, 2024 11:35

t-vi mentioned this pull request Dec 17, 2024

re-enable rematerialization by default after a memory-aware way is found #1562

Open

t-vi and others added 3 commits December 17, 2024 13:39

dce

6d946b3

recompute intermediates from decomposed symbols

69389bd

add/fix tests

0391d68

t-vi added 2 commits December 17, 2024 22:39

fix proxy accounting

13725d5

filter proxies in remat, improve tests

af68c02

t-vi force-pushed the tom/checkpointing-memory branch from eb43de8 to af68c02 Compare December 18, 2024 08:49

t-vi mentioned this pull request Dec 18, 2024

INTERNAL_ASSERT: Unable to find mapped root/logical domain NVIDIA/Fuser#3607

Closed

t-vi added 2 commits December 18, 2024 10:18

remat and recomp are exclusive

1159797

fix cuda mem accounting

6f5ab69

riccardofelluga mentioned this pull request Dec 18, 2024

Investigate Memory and Performance difference using nvfuser vs torch.compile executor on Qwen2 #1552

Open

riccardofelluga self-requested a review December 18, 2024 12:18

fw bw fw bw ist the right order

891bf84

t-vi force-pushed the tom/checkpointing-memory branch from e103484 to 891bf84 Compare December 18, 2024 14:38

Merge branch 'main' into tom/checkpointing-memory

d8332ce

random handling in recompute

a08e086

t-vi mentioned this pull request Dec 19, 2024

Fix #3607 NVIDIA/Fuser#3619

Merged

try to tame more scope crazyness

edc5343

t-vi added 5 commits December 19, 2024 13:33

del last used in torchcompile

02be728

updates

83c0739

Merge remote-tracking branch 'origin/main' into tom/checkpointing-memory

ddc779d

memory annotation

05a2a66

wip simple fusion

b1604ee

t-vi force-pushed the tom/checkpointing-memory branch from ebf420c to b1604ee Compare December 20, 2024 21:50

split ThunderFunction to allow deallocation of grad_out during proces…

5d5b29c

…sing of the backward compiled function

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

update checkpointing support for jit #1560

update checkpointing support for jit #1560

t-vi commented Dec 17, 2024 •

edited

Loading

t-vi commented Dec 17, 2024

review-notebook-app bot commented Dec 18, 2024

t-vi commented Dec 18, 2024

riccardofelluga commented Dec 18, 2024

t-vi commented Dec 19, 2024 •

edited

Loading

t-vi commented Dec 19, 2024

t-vi commented Dec 21, 2024

update checkpointing support for jit #1560

Are you sure you want to change the base?

update checkpointing support for jit #1560

Conversation

t-vi commented Dec 17, 2024 • edited Loading

t-vi commented Dec 17, 2024

review-notebook-app bot commented Dec 18, 2024

t-vi commented Dec 18, 2024

riccardofelluga commented Dec 18, 2024

t-vi commented Dec 19, 2024 • edited Loading

t-vi commented Dec 19, 2024

t-vi commented Dec 21, 2024

t-vi commented Dec 17, 2024 •

edited

Loading

t-vi commented Dec 19, 2024 •

edited

Loading