-
Notifications
You must be signed in to change notification settings - Fork 173
feat: DTensorPolicyV2 GPT-OSS support #1470
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
36db136
8e562e8
5a1cff1
acff747
0cbc3ac
3336fe9
19d29aa
dcb4cb2
738338f
62acdfc
b6a3fdd
2b86310
dd634b6
2f74d79
1163407
d270a5b
d038aca
e936ebf
7df0cc5
b4139f1
a55a2f1
39bd74c
4e151cb
4b6ce6d
ef2f92c
1eef903
24214e9
2ed872a
5489b21
ed69abd
b754c7c
3877e79
661b596
d89180c
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,29 @@ | ||
| defaults: ../../sft.yaml | ||
| policy: | ||
| model_name: openai/gpt-oss-20b | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I believe you have some plots for the convergence of gpt-oss, can you paste them to the PR? so that others can know this recipe's results. Also do you have tested other models (e.g., llama, qwen) using this PR to make sure this PR won't affect other models? there's a lot of changes in the dtensor v2 worker. |
||
| train_global_batch_size: 128 | ||
| train_micro_batch_size: 8 | ||
| max_total_sequence_length: 512 | ||
| dequantize_base_checkpoint: true | ||
| automodel_model_kwargs: | ||
| backend: | ||
| _target_: nemo_automodel.components.moe.utils.BackendConfig | ||
| attn: te | ||
| linear: te | ||
| rms_norm: te | ||
| enable_deepep: true | ||
| fake_balanced_gate: false | ||
| enable_hf_state_dict_adapter: true | ||
| dtensor_cfg: | ||
| _v2: true | ||
| expert_parallel_size: 8 | ||
| data_parallel_size: 8 | ||
| optimizer: | ||
| name: transformer_engine.pytorch.optimizers.fused_adam.FusedAdam | ||
| kwargs: | ||
| store_param_remainders: true | ||
| master_weights: true | ||
| exp_avg_dtype: bfloat16 | ||
| exp_avg_sq_dtype: bfloat16 | ||
| checkpointing: | ||
| checkpoint_dir: results/sft-gpt-oss-20b-1n8g-fsdp8ep8-automodel | ||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you add the nightly test for this?
you can refer to
tests/test_suites/llm/grpo-deepscaler-1.5b-8K.sh.