-
Notifications
You must be signed in to change notification settings - Fork 3.2k
[megatron, vllm] feat: NVFP4 (W4A16) QAT training support via ModelOpt #5254
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
root seems not to be a GitHub user. You need a GitHub account to be able to sign the CLA. If you have already a GitHub account, please add the email address used for this commit to your account. You have signed the CLA already but the status is still pending? Let us recheck it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces support for NVFP4 Quantization-Aware Training (QAT) by integrating with NVIDIA's ModelOpt. The changes are comprehensive, touching upon configuration, the Megatron training worker, and adding utility modules for QAT processing and vLLM patching. The implementation appears well-thought-out, especially in handling the complexities of distributed training environments (TP, PP, EP). My main feedback is on a piece of duplicated code that should be refactored to improve maintainability. Overall, this is a solid feature addition.
| def _create_param_from_subclass_attributes(custom_data, custom_weight): | ||
| param = Parameter(custom_data, requires_grad=False) | ||
| base_param_dir = dir(torch.nn.Parameter) | ||
| custom_weight_dir = dir(custom_weight) | ||
| # Find the attributes that are unique to the custom parameter | ||
| custom_attributes = [ | ||
| attr for attr in custom_weight_dir if attr not in base_param_dir and not attr.startswith("__") | ||
| ] | ||
| # Set the custom attributes into the base parameter object | ||
| for attr in custom_attributes: | ||
| setattr(param, attr, getattr(custom_weight, attr)) | ||
|
|
||
| return param | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This helper function _create_param_from_subclass_attributes is a duplicate of the one defined at the top level of this file (line 107). To avoid code duplication and improve maintainability, please remove this inner function and use the top-level one instead. The other function process_weights_after_loading_moe in this file already uses the top-level helper.
| quantization: null | ||
|
|
||
| # Whether to enable Quantization-Aware Training (QAT). Default False. | ||
| enable_qat: False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
better to use the same configs as in FSDP to avoid confusing the users
What does this PR do?
This PR adds support for NVFP4 (W4A16) Quantization-Aware Training (QAT) in verl's Megatron training pipeline, with quantized weight transfer to the vLLM rollout engine for inference. The implementation leverages NVIDIA ModelOpt for quantization.
Checklist Before Starting
[{modules}] {type}: {description}(This will be checked by the CI){modules}includefsdp,megatron,veomni,sglang,vllm,rollout,trainer,ci,training_utils,recipe,hardware,deployment,ray,worker,single_controller,misc,perf,model,algo,env,tool,ckpt,doc,data,cfg,reward,like[megatron, fsdp, doc]{type}is infeat,fix,refactor,chore,test[BREAKING]to the beginning of the title.[BREAKING][fsdp, megatron] feat: dynamic batchingTest
This feature has been validated through end-to-end QAT training experiments:
API and Usage Example
Enable NVFP4 QAT by adding the following to the actor Megatron config:
The training loop automatically:
ModelOptNvFp4LinearMethod/ModelOptNvFp4FusedMoEfor inferenceNo changes to the training script are required beyond the configuration above.
Design & Code Changes
Checklist Before Submitting
Important
Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review.
pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=alwaysci-requestchannel in theverlSlack workspace. (If not accessible, please try the Feishu group (飞书群).)recipesubmodule, please also update the reference to the submodule commit viagit submodule update --remoteorcd recipe && git pull origin main.