-
Notifications
You must be signed in to change notification settings - Fork 87
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
refactor recomputation to work with tags #1615
Conversation
So there is quite a regression in memory use:
gives
So 10% faster and not that much more memory (not sure what the 0.02GB are...) |
So I'm not entirely happy with disabling the |
Unfortunately, disabling the forward-backward-rematerialization adversely affects memory for Qwen, but happily we recover that when we do better checkpointing:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall it looks great! Just a couple of nits and clarifications
jfn = thunder.jit(fn, enable_saved_for_backward_recomputation=False) | ||
jfn2 = thunder.jit(fn, enable_saved_for_backward_recomputation=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the default value now?
No, it doesn't match the behavior of
The rematerialization code has assumptions on input traces to be functioning, these assumptions were violated resulting in the "infinite capacity" error. @riccardofelluga was hitting the same problem when working on the recomputation. The problems are supposed to be fixed with #1367. Riccardo, what's the current status of 1367? |
Another step of #1560
This refactors the recomputation (activation checkpointing):
This is expected to be memory/compute neutral (I'll report numbers in a bit). It does not yet do the checkpointing frontend for the jit (including using memory commparable to eager checkpointing), that will be a separate PR.
@riccardofelluga could you take a look? (should be no surprises relative to #1560)