-
Notifications
You must be signed in to change notification settings - Fork 3.2k
[megatron] fix: add megatron checkpoint patch #5251
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a monkey patch to address a GPU memory leak in Megatron's checkpointing mechanism when used with Mixture-of-Experts (MoE) models. The patch manually releases memory for input tensors and their gradients within the CheckpointFunction.backward method. The approach is sound for fixing the leak. However, the patch is applied unconditionally, which could pose a risk to non-MoE models. I've suggested making its application conditional based on whether an MoE model is in use.
| # Apply checkpoint patch for MoE models | ||
| from verl.models.mcore.patch import apply_patch_checkpoint | ||
|
|
||
| apply_patch_checkpoint() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The comment "Apply checkpoint patch for MoE models" indicates this patch is specific to Mixture-of-Experts models. However, it's being applied unconditionally. This could introduce risks or unintended side effects for non-MoE models that also use checkpointing. It would be safer to apply this patch conditionally, only when an MoE model is detected. You can check for this using self.engine_config.expert_model_parallel_size > 1.
| # Apply checkpoint patch for MoE models | |
| from verl.models.mcore.patch import apply_patch_checkpoint | |
| apply_patch_checkpoint() | |
| # Apply checkpoint patch for MoE models | |
| if self.engine_config.expert_model_parallel_size > 1: | |
| from verl.models.mcore.patch import apply_patch_checkpoint | |
| apply_patch_checkpoint() |
|
megatron PR link NVIDIA/Megatron-LM#3267 |
What does this PR do?
Unreleased GPU memory occurs during RL or SFT training of MoE models (e.g., Qwen3-30B-A3B/Qwen3-VL-30B-A3B) using image verlai/verl:vllm012.dev3 with checkpoint enabled (recompute_method=uniform, recompute_granularity=full, recompute_num_layers=1). Key observations:
Checklist Before Starting
[{modules}] {type}: {description}(This will be checked by the CI){modules}includefsdp,megatron,veomni,sglang,vllm,rollout,trainer,ci,training_utils,recipe,hardware,deployment,ray,worker,single_controller,misc,perf,model,algo,env,tool,ckpt,doc,data,cfg,reward,like[megatron, fsdp, doc]{type}is infeat,fix,refactor,chore,test[BREAKING]to the beginning of the title.[BREAKING][fsdp, megatron] feat: dynamic batchingTest
API and Usage Example
# Add code snippet or script demonstrating how to use thisDesign & Code Changes
Checklist Before Submitting
Important
Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review.
pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=alwaysci-requestchannel in theverlSlack workspace. (If not accessible, please try the Feishu group (飞书群).)recipesubmodule, please also update the reference to the submodule commit viagit submodule update --remoteorcd recipe && git pull origin main.