-
Notifications
You must be signed in to change notification settings - Fork 160
feat: Onboard perf recipes in tests #1322
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
826e9d8
a69d960
9240afb
14ac6f3
4987695
d0d54c3
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,59 @@ | ||
| defaults: ../../../grpo_math_1B.yaml | ||
| grpo: | ||
| num_prompts_per_step: 64 | ||
| num_generations_per_prompt: 32 | ||
| max_num_steps: 500 | ||
| val_batch_size: 5 | ||
| max_val_samples: 16 | ||
| loss_fn: | ||
| use_importance_sampling_correction: true | ||
| checkpointing: | ||
| checkpoint_dir: results/grpo-deepseek-v3-32n8g | ||
| policy: | ||
| model_name: unsloth/DeepSeek-V3-0324-BF16 | ||
| tokenizer: | ||
| name: unsloth/DeepSeek-V3-0324-BF16 | ||
| train_micro_batch_size: 1 | ||
| logprob_batch_size: 1 | ||
| max_total_sequence_length: 1536 | ||
| make_sequence_length_divisible_by: 1 | ||
| dtensor_cfg: | ||
| enabled: false | ||
| megatron_cfg: | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @guyueh1, did you decide to remove TP=8? So, DSV3 is able to fit only with EP16PP16? |
||
| enabled: true | ||
| empty_unused_memory_level: 1 | ||
| converter_type: LlamaForCausalLM | ||
| pipeline_model_parallel_size: 16 | ||
| expert_model_parallel_size: 16 | ||
| activation_checkpointing: true | ||
| num_layers_in_first_pipeline_stage: 3 | ||
| num_layers_in_last_pipeline_stage: 2 | ||
| apply_rope_fusion: false | ||
| moe_permute_fusion: true | ||
| defer_fp32_logits: true | ||
| optimizer: | ||
| lr: 5.0e-07 | ||
| min_lr: 5.0e-08 | ||
| weight_decay: 0.0 | ||
| use_precision_aware_optimizer: true | ||
| scheduler: | ||
| lr_warmup_iters: 2 | ||
| lr_warmup_init: 5.0e-08 | ||
| fp8_cfg: | ||
| enabled: false | ||
| generation: | ||
| stop_token_ids: | ||
| - 128009 | ||
| vllm_cfg: | ||
| tensor_parallel_size: 32 | ||
| async_engine: true | ||
| logger: | ||
| log_dir: logs/grpo-deepseek-v3-32n8g | ||
| wandb_enabled: true | ||
| tensorboard_enabled: true | ||
| wandb: | ||
| project: nemo-rl | ||
| name: grpo-deepseek-v3-32n8g | ||
| cluster: | ||
| gpus_per_node: 8 | ||
| num_nodes: 32 | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,54 @@ | ||
| defaults: ../../../grpo_math_1B.yaml | ||
| grpo: | ||
| num_prompts_per_step: 64 | ||
| num_generations_per_prompt: 32 | ||
| max_num_steps: 500 | ||
| loss_fn: | ||
| use_importance_sampling_correction: true | ||
| checkpointing: | ||
| checkpoint_dir: results/grpo-llama3.1-8b-instruct-1n8g-megatron-fp8-e2e | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 🧩 Analysis chainClarify naming inconsistency: "fp8-e2e" in paths but FP8 disabled. The checkpoint and log directories reference "megatron-fp8-e2e" (suggesting FP8 end-to-end testing), but the configuration disables FP8 with Also applies to: 46-46, 51-51 I'll verify the review comment by examining the YAML file and confirming the inconsistency claims. Length of output: 1362 Verify naming inconsistency: "fp8-e2e" in paths but FP8 disabled. The review comment is valid. Verification confirms:
This creates misleading path names suggesting FP8 end-to-end testing when FP8 is actually disabled. The developer should either enable FP8 or rename the directories to reflect the actual configuration (e.g., "megatron-nofp8" or similar). 🤖 Prompt for AI Agents |
||
| policy: | ||
| model_name: meta-llama/Llama-3.1-8B-Instruct | ||
| tokenizer: | ||
| name: meta-llama/Llama-3.1-8B-Instruct | ||
| train_micro_batch_size: 1 | ||
| logprob_batch_size: 2 | ||
| max_total_sequence_length: 4096 | ||
| make_sequence_length_divisible_by: 1 | ||
| dtensor_cfg: | ||
| enabled: false | ||
| megatron_cfg: | ||
| enabled: true | ||
| empty_unused_memory_level: 1 | ||
| converter_type: LlamaForCausalLM | ||
| pipeline_model_parallel_size: 2 | ||
| activation_checkpointing: true | ||
| defer_fp32_logits: true | ||
| optimizer: | ||
| lr: 5.0e-07 | ||
| min_lr: 5.0e-08 | ||
| weight_decay: 0.0 | ||
| use_precision_aware_optimizer: true | ||
| scheduler: | ||
| lr_warmup_iters: 2 | ||
| lr_warmup_init: 5.0e-08 | ||
| fp8_cfg: | ||
| enabled: false | ||
| generation: | ||
| max_new_tokens: 4096 | ||
| stop_token_ids: | ||
| - 128009 | ||
| vllm_cfg: | ||
| max_model_len: 4096 | ||
| data: | ||
| max_input_seq_length: 4096 | ||
| logger: | ||
| log_dir: logs/grpo-llama3.1-8b-instruct-1n8g-megatron-fp8-e2e | ||
| wandb_enabled: true | ||
| tensorboard_enabled: true | ||
| wandb: | ||
| project: nemo-rl | ||
| name: grpo-llama3.1-8b-instruct-1n8g-megatron-fp8-e2e | ||
| cluster: | ||
| gpus_per_node: 8 | ||
| num_nodes: 1 | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,54 @@ | ||
| defaults: ../../../grpo_math_1B.yaml | ||
| grpo: | ||
| num_prompts_per_step: 64 | ||
| num_generations_per_prompt: 32 | ||
| max_num_steps: 500 | ||
| loss_fn: | ||
| use_importance_sampling_correction: true | ||
| checkpointing: | ||
| checkpoint_dir: results/grpo-llama3.1-8b-instruct-1n8g-megatron-fp8-e2e | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 🧩 Analysis chainVerify checkpoint directory naming. The checkpoint directory references Consider updating to match the node configuration: - checkpoint_dir: results/grpo-llama3.1-8b-instruct-1n8g-megatron-fp8-e2e
+ checkpoint_dir: results/grpo-llama3.1-8b-instruct-2n8g-megatron-fp8-e2e🏁 Script executed: #!/bin/bash
# Verify the file exists and check its content
echo "=== Checking target file ==="
if [ -f "examples/configs/recipes/llm/performance/grpo-llama3.1-8b-instruct-2n8g.yaml" ]; then
echo "File found. Content around line 9:"
sed -n '5,15p' examples/configs/recipes/llm/performance/grpo-llama3.1-8b-instruct-2n8g.yaml
else
echo "File not found"
fi
echo -e "\n=== Checking for related config files ==="
fd "grpo-llama3.1-8b-instruct" examples/configs/recipes/llm/performance/ --type f
echo -e "\n=== Checking for other 1n8g configs in the same directory ==="
fd "1n8g" examples/configs/recipes/llm/performance/ --type f
echo -e "\n=== Checking for 2n8g configs in the same directory ==="
fd "2n8g" examples/configs/recipes/llm/performance/ --type fLength of output: 996 🏁 Script executed: #!/bin/bash
# Verify the file exists and inspect relevant sections
echo "=== Checking target file ==="
if [ -f "examples/configs/recipes/llm/performance/grpo-llama3.1-8b-instruct-2n8g.yaml" ]; then
echo "File found. Full content:"
cat examples/configs/recipes/llm/performance/grpo-llama3.1-8b-instruct-2n8g.yaml
else
echo "File not found"
fi
echo -e "\n=== Checking for related grpo-llama3.1 config files ==="
fd "grpo-llama3.1" examples/configs/recipes/llm/performance/ --type f
echo -e "\n=== Checking for other configurations with 1n8g or 2n8g patterns ==="
fd "(1n8g|2n8g)" examples/configs/recipes/llm/ --type f | head -20Length of output: 3367 🏁 Script executed: #!/bin/bash
# Check the 1n8g config file to understand the pattern
echo "=== Content of grpo-llama3.1-8b-instruct-1n8g.yaml ==="
cat examples/configs/recipes/llm/performance/grpo-llama3.1-8b-instruct-1n8g.yaml
echo -e "\n=== Comparison: cluster config in both files ==="
echo "1n8g file cluster config:"
grep -A2 "^cluster:" examples/configs/recipes/llm/performance/grpo-llama3.1-8b-instruct-1n8g.yaml
echo ""
echo "2n8g file cluster config:"
grep -A2 "^cluster:" examples/configs/recipes/llm/performance/grpo-llama3.1-8b-instruct-2n8g.yamlLength of output: 1605 Update checkpoint directory, log directory, and wandb name to match 2n8g configuration. The 2n8g.yaml file is configured for 2 nodes (cluster.num_nodes: 2) but all output paths reference "1n8g" naming. This is a copy-paste artifact from the 1n8g file and causes a mismatch between cluster configuration and output naming. Update three locations: checkpoint_dir: results/grpo-llama3.1-8b-instruct-1n8g-megatron-fp8-e2e
+ checkpoint_dir: results/grpo-llama3.1-8b-instruct-2n8g-megatron-fp8-e2e log_dir: logs/grpo-llama3.1-8b-instruct-1n8g-megatron-fp8-e2e
+ log_dir: logs/grpo-llama3.1-8b-instruct-2n8g-megatron-fp8-e2e name: grpo-llama3.1-8b-instruct-1n8g-megatron-fp8-e2e
+ name: grpo-llama3.1-8b-instruct-2n8g-megatron-fp8-e2e
🤖 Prompt for AI Agents |
||
| policy: | ||
| model_name: meta-llama/Llama-3.1-8B-Instruct | ||
| tokenizer: | ||
| name: meta-llama/Llama-3.1-8B-Instruct | ||
| train_micro_batch_size: 1 | ||
| logprob_batch_size: 2 | ||
| max_total_sequence_length: 4096 | ||
| make_sequence_length_divisible_by: 1 | ||
| dtensor_cfg: | ||
| enabled: false | ||
| megatron_cfg: | ||
| enabled: true | ||
| empty_unused_memory_level: 1 | ||
| converter_type: LlamaForCausalLM | ||
| pipeline_model_parallel_size: 2 | ||
| activation_checkpointing: true | ||
| defer_fp32_logits: true | ||
| optimizer: | ||
| lr: 5.0e-07 | ||
| min_lr: 5.0e-08 | ||
| weight_decay: 0.0 | ||
| use_precision_aware_optimizer: true | ||
| scheduler: | ||
| lr_warmup_iters: 2 | ||
| lr_warmup_init: 5.0e-08 | ||
| fp8_cfg: | ||
| enabled: false | ||
| generation: | ||
| max_new_tokens: 4096 | ||
| stop_token_ids: | ||
| - 128009 | ||
| vllm_cfg: | ||
| max_model_len: 4096 | ||
| data: | ||
| max_input_seq_length: 4096 | ||
| logger: | ||
| log_dir: logs/grpo-llama3.1-8b-instruct-1n8g-megatron-fp8-e2e | ||
| wandb_enabled: true | ||
| tensorboard_enabled: true | ||
| wandb: | ||
| project: nemo-rl | ||
| name: grpo-llama3.1-8b-instruct-1n8g-megatron-fp8-e2e | ||
|
Comment on lines
+46
to
+51
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 🧩 Analysis chainVerify logging directory and WandB naming. The log directory and WandB name reference Consider updating to match the node configuration: - log_dir: logs/grpo-llama3.1-8b-instruct-1n8g-megatron-fp8-e2e
+ log_dir: logs/grpo-llama3.1-8b-instruct-2n8g-megatron-fp8-e2e
wandb_enabled: true
tensorboard_enabled: true
wandb:
project: nemo-rl
- name: grpo-llama3.1-8b-instruct-1n8g-megatron-fp8-e2e
+ name: grpo-llama3.1-8b-instruct-2n8g-megatron-fp8-e2eLength of output: 1362 🏁 Script executed: #!/bin/bash
# Search for similar recipe files to understand naming conventions
fd -e yaml 'grpo.*\.yaml' examples/configs/recipes/llm/performance/ | head -20Length of output: 797 🏁 Script executed: #!/bin/bash
# Check for any 1n8g variant of this configuration
fd -e yaml 'grpo-llama3.1.*1n8g' examples/configs/recipes/Length of output: 248 🏁 Script executed: #!/bin/bash
# Look for other 2n8g recipes to compare naming patterns
fd -e yaml '2n8g' examples/configs/recipes/ | head -10Length of output: 833 🏁 Script executed: #!/bin/bash
# Search for patterns with num_nodes in YAML files to understand cluster config
rg -n 'num_nodes|num_gpus_per_node|cluster' examples/configs/recipes/llm/performance/grpo-llama3.1-8b-instruct-2n8g.yaml -A 2Length of output: 51 Update logging directory, checkpoint directory, and WandB naming to match 2-node cluster configuration. The file is configured for a 2-node setup ( - checkpoint_dir: results/grpo-llama3.1-8b-instruct-1n8g-megatron-fp8-e2e
+ checkpoint_dir: results/grpo-llama3.1-8b-instruct-2n8g-megatron-fp8-e2e
...
- log_dir: logs/grpo-llama3.1-8b-instruct-1n8g-megatron-fp8-e2e
+ log_dir: logs/grpo-llama3.1-8b-instruct-2n8g-megatron-fp8-e2e
wandb_enabled: true
tensorboard_enabled: true
wandb:
project: nemo-rl
- name: grpo-llama3.1-8b-instruct-1n8g-megatron-fp8-e2e
+ name: grpo-llama3.1-8b-instruct-2n8g-megatron-fp8-e2e
🤖 Prompt for AI Agents |
||
| cluster: | ||
| gpus_per_node: 8 | ||
| num_nodes: 2 | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,42 @@ | ||
| defaults: ../../../grpo_math_1B.yaml | ||
| grpo: | ||
| num_prompts_per_step: 64 | ||
| num_generations_per_prompt: 32 | ||
| checkpointing: | ||
| enabled: false | ||
| checkpoint_dir: results/grpo-llama3.3-70b-instruct-4n8g-16k | ||
| policy: | ||
| model_name: meta-llama/Llama-3.3-70B-Instruct | ||
| train_micro_batch_size: 1 | ||
| max_total_sequence_length: 16384 | ||
| dtensor_cfg: | ||
| enabled: false | ||
| optimizer: null | ||
| scheduler: null | ||
| make_sequence_length_divisible_by: ${policy.megatron_cfg.tensor_model_parallel_size} | ||
| megatron_cfg: | ||
| enabled: true | ||
| empty_unused_memory_level: 1 | ||
| activation_checkpointing: true | ||
| tensor_model_parallel_size: 4 | ||
| pipeline_model_parallel_size: 8 | ||
| sequence_parallel: true | ||
| optimizer: | ||
| lr: 3.0e-07 | ||
| min_lr: 3.0e-08 | ||
| scheduler: | ||
| lr_warmup_iters: 2 | ||
| lr_warmup_init: 3.0e-08 | ||
| generation: | ||
| vllm_cfg: | ||
| tensor_parallel_size: 4 | ||
| logger: | ||
| log_dir: logs/grpo-llama3.3-70b-instruct-4n8g-16k | ||
| wandb_enabled: true | ||
| tensorboard_enabled: true | ||
| wandb: | ||
| project: nemo-rl | ||
| name: grpo-llama3.3-70b-instruct-4n8g-16k | ||
| cluster: | ||
| gpus_per_node: 8 | ||
| num_nodes: 4 |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,41 @@ | ||
| defaults: ../../../grpo_math_1B.yaml | ||
| grpo: | ||
| num_prompts_per_step: 64 | ||
| num_generations_per_prompt: 32 | ||
| checkpointing: | ||
| enabled: false | ||
| checkpoint_dir: results/grpo-llama3.3-70b-instruct-4n8g | ||
| policy: | ||
| model_name: meta-llama/Llama-3.3-70B-Instruct | ||
| train_micro_batch_size: 1 | ||
| max_total_sequence_length: 4096 | ||
| dtensor_cfg: | ||
| enabled: false | ||
| optimizer: null | ||
| scheduler: null | ||
| make_sequence_length_divisible_by: ${policy.megatron_cfg.tensor_model_parallel_size} | ||
| megatron_cfg: | ||
| enabled: true | ||
| empty_unused_memory_level: 1 | ||
| tensor_model_parallel_size: 4 | ||
| pipeline_model_parallel_size: 8 | ||
| sequence_parallel: true | ||
| optimizer: | ||
| lr: 3.0e-07 | ||
| min_lr: 3.0e-08 | ||
| scheduler: | ||
| lr_warmup_iters: 2 | ||
| lr_warmup_init: 3.0e-08 | ||
| generation: | ||
| vllm_cfg: | ||
| tensor_parallel_size: 4 | ||
| logger: | ||
| log_dir: logs/grpo-llama3.3-70b-instruct-4n8g | ||
| wandb_enabled: true | ||
| tensorboard_enabled: true | ||
| wandb: | ||
| project: nemo-rl | ||
| name: grpo-llama3.3-70b-instruct-4n8g | ||
| cluster: | ||
| gpus_per_node: 8 | ||
| num_nodes: 4 |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,61 @@ | ||
| defaults: ../../../grpo_math_1B.yaml | ||
| grpo: | ||
| num_prompts_per_step: 16 | ||
| num_generations_per_prompt: 32 | ||
| max_num_steps: 500 | ||
| val_batch_size: 5 | ||
| max_val_samples: 16 | ||
| loss_fn: | ||
| use_importance_sampling_correction: true | ||
| checkpointing: | ||
| checkpoint_dir: results/grpo-qwen3-235b-16n8g | ||
| policy: | ||
| model_name: Qwen/Qwen3-235B-A22B | ||
| tokenizer: | ||
| name: Qwen/Qwen3-235B-A22B | ||
| train_micro_batch_size: 1 | ||
| logprob_batch_size: 1 | ||
| max_total_sequence_length: 8192 | ||
| make_sequence_length_divisible_by: 1 | ||
| dtensor_cfg: | ||
| enabled: false | ||
| megatron_cfg: | ||
| enabled: true | ||
| empty_unused_memory_level: 1 | ||
| converter_type: LlamaForCausalLM | ||
| tensor_model_parallel_size: 2 | ||
| sequence_parallel: true | ||
| pipeline_model_parallel_size: 8 | ||
| context_parallel_size: 2 | ||
| expert_model_parallel_size: 16 | ||
| activation_checkpointing: true | ||
| num_layers_in_first_pipeline_stage: 11 | ||
| num_layers_in_last_pipeline_stage: 11 | ||
| moe_permute_fusion: true | ||
| defer_fp32_logits: true | ||
| optimizer: | ||
| lr: 5.0e-07 | ||
| min_lr: 5.0e-08 | ||
| weight_decay: 0.0 | ||
| use_precision_aware_optimizer: true | ||
| scheduler: | ||
| lr_warmup_iters: 2 | ||
| lr_warmup_init: 5.0e-08 | ||
| fp8_cfg: | ||
| enabled: false | ||
| generation: | ||
| stop_token_ids: | ||
| - 128009 | ||
| vllm_cfg: | ||
| tensor_parallel_size: 16 | ||
| async_engine: true | ||
| logger: | ||
| log_dir: logs/grpo-qwen3-235b-16n8g | ||
| wandb_enabled: true | ||
| tensorboard_enabled: false # to avoid a bug | ||
| wandb: | ||
| project: nemo-rl | ||
| name: grpo-qwen3-235b-16n8g | ||
| cluster: | ||
| gpus_per_node: 8 | ||
| num_nodes: 16 |
| Original file line number | Diff line number | Diff line change | ||||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| @@ -0,0 +1,61 @@ | ||||||||||||||||||||||||||
| defaults: ../../../grpo_math_1B.yaml | ||||||||||||||||||||||||||
| grpo: | ||||||||||||||||||||||||||
| num_prompts_per_step: 16 | ||||||||||||||||||||||||||
| num_generations_per_prompt: 32 | ||||||||||||||||||||||||||
| max_num_steps: 500 | ||||||||||||||||||||||||||
| val_batch_size: 5 | ||||||||||||||||||||||||||
| max_val_samples: 16 | ||||||||||||||||||||||||||
| loss_fn: | ||||||||||||||||||||||||||
| use_importance_sampling_correction: true | ||||||||||||||||||||||||||
| checkpointing: | ||||||||||||||||||||||||||
| checkpoint_dir: results/grpo-qwen3-235b-16n8g | ||||||||||||||||||||||||||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Fix checkpoint directory naming mismatch. The checkpoint directory references Apply this diff: - checkpoint_dir: results/grpo-qwen3-235b-16n8g
+ checkpoint_dir: results/grpo-qwen3-235b-32n8g📝 Committable suggestion
Suggested change
🤖 Prompt for AI Agents |
||||||||||||||||||||||||||
| policy: | ||||||||||||||||||||||||||
| model_name: Qwen/Qwen3-235B-A22B | ||||||||||||||||||||||||||
| tokenizer: | ||||||||||||||||||||||||||
| name: Qwen/Qwen3-235B-A22B | ||||||||||||||||||||||||||
| train_micro_batch_size: 1 | ||||||||||||||||||||||||||
| logprob_batch_size: 1 | ||||||||||||||||||||||||||
| max_total_sequence_length: 8192 | ||||||||||||||||||||||||||
| make_sequence_length_divisible_by: 1 | ||||||||||||||||||||||||||
| dtensor_cfg: | ||||||||||||||||||||||||||
| enabled: false | ||||||||||||||||||||||||||
| megatron_cfg: | ||||||||||||||||||||||||||
| enabled: true | ||||||||||||||||||||||||||
| empty_unused_memory_level: 1 | ||||||||||||||||||||||||||
| converter_type: LlamaForCausalLM | ||||||||||||||||||||||||||
| tensor_model_parallel_size: 2 | ||||||||||||||||||||||||||
| sequence_parallel: true | ||||||||||||||||||||||||||
| pipeline_model_parallel_size: 8 | ||||||||||||||||||||||||||
| context_parallel_size: 2 | ||||||||||||||||||||||||||
| expert_model_parallel_size: 16 | ||||||||||||||||||||||||||
| activation_checkpointing: true | ||||||||||||||||||||||||||
| num_layers_in_first_pipeline_stage: 11 | ||||||||||||||||||||||||||
| num_layers_in_last_pipeline_stage: 11 | ||||||||||||||||||||||||||
| moe_permute_fusion: true | ||||||||||||||||||||||||||
| defer_fp32_logits: true | ||||||||||||||||||||||||||
| optimizer: | ||||||||||||||||||||||||||
| lr: 5.0e-07 | ||||||||||||||||||||||||||
| min_lr: 5.0e-08 | ||||||||||||||||||||||||||
| weight_decay: 0.0 | ||||||||||||||||||||||||||
| use_precision_aware_optimizer: true | ||||||||||||||||||||||||||
| scheduler: | ||||||||||||||||||||||||||
| lr_warmup_iters: 2 | ||||||||||||||||||||||||||
| lr_warmup_init: 5.0e-08 | ||||||||||||||||||||||||||
| fp8_cfg: | ||||||||||||||||||||||||||
| enabled: false | ||||||||||||||||||||||||||
| generation: | ||||||||||||||||||||||||||
| stop_token_ids: | ||||||||||||||||||||||||||
| - 128009 | ||||||||||||||||||||||||||
| vllm_cfg: | ||||||||||||||||||||||||||
| tensor_parallel_size: 16 | ||||||||||||||||||||||||||
| async_engine: true | ||||||||||||||||||||||||||
| logger: | ||||||||||||||||||||||||||
| log_dir: logs/grpo-qwen3-235b-16n8g | ||||||||||||||||||||||||||
| wandb_enabled: true | ||||||||||||||||||||||||||
| tensorboard_enabled: false # to avoid a bug | ||||||||||||||||||||||||||
| wandb: | ||||||||||||||||||||||||||
| project: nemo-rl | ||||||||||||||||||||||||||
| name: grpo-qwen3-235b-16n8g | ||||||||||||||||||||||||||
|
Comment on lines
+53
to
+58
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Fix logging directory and WandB naming mismatch. The log directory and WandB name reference Apply this diff: - log_dir: logs/grpo-qwen3-235b-16n8g
+ log_dir: logs/grpo-qwen3-235b-32n8g
wandb_enabled: true
tensorboard_enabled: false # to avoid a bug
wandb:
project: nemo-rl
- name: grpo-qwen3-235b-16n8g
+ name: grpo-qwen3-235b-32n8g📝 Committable suggestion
Suggested change
🤖 Prompt for AI Agents |
||||||||||||||||||||||||||
| cluster: | ||||||||||||||||||||||||||
| gpus_per_node: 8 | ||||||||||||||||||||||||||
| num_nodes: 32 | ||||||||||||||||||||||||||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Regarding the Rollout batch size, it is different from the spreadsheet. Could you please let me know what to follow?