[SAC] Refactor activation checkpointing to use centralized policy-based approach #2357

mori360 · 2026-02-09T21:47:02Z

The previous activation checkpointing implementation had each model define its own _op_sac_save_list in its respective parallelize.py file. Manually maintaining the op list risks missing standard compute-intensive ops that PyTorch's get_default_op_list() already provides.

This PR:

Centralize ops in activation_checkpoint.py via default_activation_checkpoint_policy()
Use get_default_op_list() from torch._functorch.partitioners as the foundation
Split ops list into compute and communicate case, and extend them with the current ops used in torchtitan but not included in get_default_op_list
Specially wrap policy by ac_config.per_op_sac_force_recompute_mm_shapes_by_fqns

…roach

mori360 added 3 commits February 9, 2026 13:08

Refactor activation checkpointing to use centralized policy-based app…

970be3f

…roach

update

f53d80d

update

448ce31

pytorch-bot bot added the ciflow/8gpu label Feb 9, 2026

meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Feb 9, 2026

mori360 added 2 commits February 9, 2026 13:51

update

d116992

update

1aad3a2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SAC] Refactor activation checkpointing to use centralized policy-based approach #2357

[SAC] Refactor activation checkpointing to use centralized policy-based approach #2357

mori360 commented Feb 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

[SAC] Refactor activation checkpointing to use centralized policy-based approach #2357

Are you sure you want to change the base?

[SAC] Refactor activation checkpointing to use centralized policy-based approach #2357

Conversation

mori360 commented Feb 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant