feat: add checkpoint retention limit to automatically clean up old checkpoints by stevewx · Pull Request #1798 · THUDM/slime

stevewx · 2026-04-02T22:25:58Z

Summary

Add --max-actor-ckpt-to-keep and --max-critic-ckpt-to-keep to limit checkpoints kept on disk during long training runs
Add --checkpoint-storage-type (shared/local) to control which rank performs cleanup
Cleanup runs before each new save with keep-1, so peak disk usage is exactly N checkpoints
Supports both Megatron and HF checkpoint formats
Only cleans up checkpoints from the current run — previous run checkpoints are untouched

Design

Unified cleanup_old_checkpoints() in slime/utils/checkpoint_utils.py with a path_fn callback to compose paths from rollout IDs
In-memory rollout ID tracking (_saved_rollout_ids) shared by both Megatron and HF cleanup
should_run_cleanup() pure function for rank selection logic, testable without distributed dependencies
_maybe_cleanup_old_checkpoints() helper in actor.py orchestrates cleanup:
- Megatron: global rank 0 (shared) or local rank 0 per node (local)
- HF: global rank 0 only (save_hf_pretrained always writes from global rank 0)

Test plan

26 unit tests covering core function, Megatron-specific, HF-specific, and rank selection behavior
Edge cases: keep=0, keep=1, exact limit, rmtree failure resilience, previous run isolation
Rank selection: shared vs local storage type, global vs local rank combinations
E2E verified with --save-interval 1 --max-actor-ckpt-to-keep 2 --num-rollout 5 on Qwen3.5-122B

…eckpoints During long training runs, checkpoints can exhaust disk space. This adds support for keeping only the N most recent checkpoints on disk, deleting older ones automatically before each new save. Supports both Megatron and HF checkpoint formats, shared and local storage types.

stevewx force-pushed the feature/max-ckpt-to-keep branch from 3812101 to 66016b1 Compare April 2, 2026 23:15

stevewx force-pushed the feature/max-ckpt-to-keep branch from 66016b1 to a08eb42 Compare April 2, 2026 23:48

Merge branch 'main' into feature/max-ckpt-to-keep

7a493cf

stevewx marked this pull request as ready for review April 3, 2026 01:14

Merge branch 'main' into feature/max-ckpt-to-keep

a2ff542

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add checkpoint retention limit to automatically clean up old checkpoints#1798

feat: add checkpoint retention limit to automatically clean up old checkpoints#1798
stevewx wants to merge 3 commits intoTHUDM:mainfrom
stevewx:feature/max-ckpt-to-keep

stevewx commented Apr 2, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

stevewx commented Apr 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Design

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

stevewx commented Apr 2, 2026 •

edited

Loading