Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
34 commits
Select commit Hold shift + click to select a range
36db136
automodel on latest main
adil-a Oct 28, 2025
8e562e8
new automodel checkpointing
adil-a Oct 29, 2025
5a1cff1
adding automodel sharding
adil-a Oct 29, 2025
acff747
adding moe init
adil-a Oct 29, 2025
0cbc3ac
fix
adil-a Oct 29, 2025
3336fe9
removing legacy checkpointing utils
adil-a Oct 29, 2025
19d29aa
linting
adil-a Oct 29, 2025
dcb4cb2
adding moe check
adil-a Oct 29, 2025
738338f
automodel
adil-a Oct 29, 2025
62acdfc
latest automodel bump
adil-a Oct 29, 2025
b6a3fdd
changes
adil-a Oct 30, 2025
2b86310
cfg
adil-a Oct 30, 2025
dd634b6
eof fix
adil-a Oct 31, 2025
2f74d79
feat: automodel moe integration
hemildesai Nov 4, 2025
1163407
bump
Nov 5, 2025
d270a5b
adding torch arch list for grouped gemm isntall
adil-a Nov 5, 2025
d038aca
linting
adil-a Nov 5, 2025
e936ebf
main merge
adil-a Nov 5, 2025
7df0cc5
uv lock
adil-a Nov 5, 2025
b4139f1
fix
adil-a Nov 5, 2025
a55a2f1
wandb yaml fix
adil-a Nov 5, 2025
39bd74c
minimizing yaml
adil-a Nov 5, 2025
4e151cb
clean up
adil-a Nov 5, 2025
4b6ce6d
dtype map
adil-a Nov 5, 2025
ef2f92c
lint
adil-a Nov 5, 2025
1eef903
removing unit test
adil-a Nov 5, 2025
24214e9
adding fixes from unit tests
Nov 6, 2025
2ed872a
merging main
adil-a Nov 25, 2025
5489b21
bumping automodel + v2 fixes
adil-a Nov 25, 2025
ed69abd
pre-commit
adil-a Nov 25, 2025
b754c7c
ckpt fix
adil-a Nov 26, 2025
3877e79
pre commit
adil-a Nov 26, 2025
661b596
Sync Automodel submodule to origin/main
adil-a Nov 26, 2025
d89180c
removing RL specific changes for future PR
adil-a Nov 26, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion 3rdparty/Automodel-workspace/Automodel
Submodule Automodel updated 433 files
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
defaults: ../../sft.yaml
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add the nightly test for this?
you can refer to tests/test_suites/llm/grpo-deepscaler-1.5b-8K.sh.

policy:
model_name: openai/gpt-oss-20b
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe you have some plots for the convergence of gpt-oss, can you paste them to the PR? so that others can know this recipe's results.

Also do you have tested other models (e.g., llama, qwen) using this PR to make sure this PR won't affect other models? there's a lot of changes in the dtensor v2 worker.

train_global_batch_size: 128
train_micro_batch_size: 8
max_total_sequence_length: 512
dequantize_base_checkpoint: true
automodel_model_kwargs:
backend:
_target_: nemo_automodel.components.moe.utils.BackendConfig
attn: te
linear: te
rms_norm: te
enable_deepep: true
fake_balanced_gate: false
enable_hf_state_dict_adapter: true
dtensor_cfg:
_v2: true
expert_parallel_size: 8
data_parallel_size: 8
optimizer:
name: transformer_engine.pytorch.optimizers.fused_adam.FusedAdam
kwargs:
store_param_remainders: true
master_weights: true
exp_avg_dtype: bfloat16
exp_avg_sq_dtype: bfloat16
checkpointing:
checkpoint_dir: results/sft-gpt-oss-20b-1n8g-fsdp8ep8-automodel
Loading
Loading