-
Notifications
You must be signed in to change notification settings - Fork 553
[Qwen3] StateDictAdapter support for MoE model #1766
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…n Qwen3 hf models
Thanks for working on this again! Can you attach a screenshot of your local run after loading HF weights? |
… general protocal, and format files
@shuhuayu nice PR! The loss and grad norms looks a bit high though -- any idea why? Another good way to validate the implementation is to run inference to check if you can get the same output tokens as the HF implementaiton. |
@vwxyzjn Thanks for the suggestion! I guess the large losses and gradient norms may be due to the suboptimal training configs we set for debug. We have verified the KL Divergence between a huggingface model and a converted torchtitan model on Qwen3 30B-A3B. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
In the future, we should consider adding unit tests for MoEStateDictAdapter
.
Reused StateDictAdapter support for DeepSeek V3 model to implement Qwen3 StateDictAdapter. Updated a checkpoint loading api to support distributed huggingface checkpoint loading when unpicklable objects exist.