Skip to content

feat: add checkpoint retention limit to automatically clean up old checkpoints#1798

Open
stevewx wants to merge 3 commits intoTHUDM:mainfrom
stevewx:feature/max-ckpt-to-keep
Open

feat: add checkpoint retention limit to automatically clean up old checkpoints#1798
stevewx wants to merge 3 commits intoTHUDM:mainfrom
stevewx:feature/max-ckpt-to-keep

Conversation

@stevewx
Copy link
Copy Markdown
Contributor

@stevewx stevewx commented Apr 2, 2026

Summary

  • Add --max-actor-ckpt-to-keep and --max-critic-ckpt-to-keep to limit checkpoints kept on disk during long training runs
  • Add --checkpoint-storage-type (shared/local) to control which rank performs cleanup
  • Cleanup runs before each new save with keep-1, so peak disk usage is exactly N checkpoints
  • Supports both Megatron and HF checkpoint formats
  • Only cleans up checkpoints from the current run — previous run checkpoints are untouched

Design

  • Unified cleanup_old_checkpoints() in slime/utils/checkpoint_utils.py with a path_fn callback to compose paths from rollout IDs
  • In-memory rollout ID tracking (_saved_rollout_ids) shared by both Megatron and HF cleanup
  • should_run_cleanup() pure function for rank selection logic, testable without distributed dependencies
  • _maybe_cleanup_old_checkpoints() helper in actor.py orchestrates cleanup:
    • Megatron: global rank 0 (shared) or local rank 0 per node (local)
    • HF: global rank 0 only (save_hf_pretrained always writes from global rank 0)

Test plan

  • 26 unit tests covering core function, Megatron-specific, HF-specific, and rank selection behavior
  • Edge cases: keep=0, keep=1, exact limit, rmtree failure resilience, previous run isolation
  • Rank selection: shared vs local storage type, global vs local rank combinations
  • E2E verified with --save-interval 1 --max-actor-ckpt-to-keep 2 --num-rollout 5 on Qwen3.5-122B

@stevewx stevewx force-pushed the feature/max-ckpt-to-keep branch from 3812101 to 66016b1 Compare April 2, 2026 23:15
…eckpoints

During long training runs, checkpoints can exhaust disk space. This adds
support for keeping only the N most recent checkpoints on disk, deleting
older ones automatically before each new save.

Supports both Megatron and HF checkpoint formats, shared and local
storage types.
@stevewx stevewx force-pushed the feature/max-ckpt-to-keep branch from 66016b1 to a08eb42 Compare April 2, 2026 23:48
@stevewx stevewx marked this pull request as ready for review April 3, 2026 01:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant