Skip to content

Conversation

@mori360
Copy link
Contributor

@mori360 mori360 commented Feb 9, 2026

The previous activation checkpointing implementation had each model define its own _op_sac_save_list in its respective parallelize.py file. Manually maintaining the op list risks missing standard compute-intensive ops that PyTorch's get_default_op_list() already provides.

This PR:

  1. Centralize ops in activation_checkpoint.py via default_activation_checkpoint_policy()
  2. Use get_default_op_list() from torch._functorch.partitioners as the foundation
  3. Split ops list into compute and communicate case, and extend them with the current ops used in torchtitan but not included in get_default_op_list
  4. Specially wrap policy by ac_config.per_op_sac_force_recompute_mm_shapes_by_fqns

@meta-cla meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Feb 9, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/8gpu CLA Signed This label is managed by the Meta Open Source bot.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant