Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
@@ -0,0 +1,59 @@
defaults: ../../../grpo_math_1B.yaml
grpo:
num_prompts_per_step: 64
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regarding the Rollout batch size, it is different from the spreadsheet. Could you please let me know what to follow?

num_generations_per_prompt: 32
max_num_steps: 500
val_batch_size: 5
max_val_samples: 16
loss_fn:
use_importance_sampling_correction: true
checkpointing:
checkpoint_dir: results/grpo-deepseek-v3-32n8g
policy:
model_name: unsloth/DeepSeek-V3-0324-BF16
tokenizer:
name: unsloth/DeepSeek-V3-0324-BF16
train_micro_batch_size: 1
logprob_batch_size: 1
max_total_sequence_length: 1536
make_sequence_length_divisible_by: 1
dtensor_cfg:
enabled: false
megatron_cfg:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@guyueh1, did you decide to remove TP=8? So, DSV3 is able to fit only with EP16PP16?

enabled: true
empty_unused_memory_level: 1
converter_type: LlamaForCausalLM
pipeline_model_parallel_size: 16
expert_model_parallel_size: 16
activation_checkpointing: true
num_layers_in_first_pipeline_stage: 3
num_layers_in_last_pipeline_stage: 2
apply_rope_fusion: false
moe_permute_fusion: true
defer_fp32_logits: true
optimizer:
lr: 5.0e-07
min_lr: 5.0e-08
weight_decay: 0.0
use_precision_aware_optimizer: true
scheduler:
lr_warmup_iters: 2
lr_warmup_init: 5.0e-08
fp8_cfg:
enabled: false
generation:
stop_token_ids:
- 128009
vllm_cfg:
tensor_parallel_size: 32
async_engine: true
logger:
log_dir: logs/grpo-deepseek-v3-32n8g
wandb_enabled: true
tensorboard_enabled: true
wandb:
project: nemo-rl
name: grpo-deepseek-v3-32n8g
cluster:
gpus_per_node: 8
num_nodes: 32
Original file line number Diff line number Diff line change
@@ -0,0 +1,54 @@
defaults: ../../../grpo_math_1B.yaml
grpo:
num_prompts_per_step: 64
num_generations_per_prompt: 32
max_num_steps: 500
loss_fn:
use_importance_sampling_correction: true
checkpointing:
checkpoint_dir: results/grpo-llama3.1-8b-instruct-1n8g-megatron-fp8-e2e
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

Clarify naming inconsistency: "fp8-e2e" in paths but FP8 disabled.

The checkpoint and log directories reference "megatron-fp8-e2e" (suggesting FP8 end-to-end testing), but the configuration disables FP8 with fp8_cfg.enabled: false on Line 36. Either enable FP8 or rename the directories to reflect the actual configuration.

Also applies to: 46-46, 51-51


I'll verify the review comment by examining the YAML file and confirming the inconsistency claims.


---

🏁 Script executed:

```shell
#!/bin/bash
# Inspect the YAML file to verify the naming inconsistency
cat examples/configs/recipes/llm/performance/grpo-llama3.1-8b-instruct-1n8g.yaml

Length of output: 1362


Verify naming inconsistency: "fp8-e2e" in paths but FP8 disabled.

The review comment is valid. Verification confirms:

  • Lines 9, 46, 51: checkpoint_dir, log_dir, and wandb name all contain "megatron-fp8-e2e"
  • Line 36: fp8_cfg.enabled is false

This creates misleading path names suggesting FP8 end-to-end testing when FP8 is actually disabled. The developer should either enable FP8 or rename the directories to reflect the actual configuration (e.g., "megatron-nofp8" or similar).

🤖 Prompt for AI Agents
In examples/configs/recipes/llm/performance/grpo-llama3.1-8b-instruct-1n8g.yaml
(notably lines 9, 36, 46, 51) the paths and WandB name include
"megatron-fp8-e2e" while fp8_cfg.enabled is false, causing a misleading
configuration; either enable FP8 by setting fp8_cfg.enabled: true if the intent
is FP8 testing, or rename all occurrences of "megatron-fp8-e2e" (checkpoint_dir,
log_dir, wandb name) to a clear non-FP8 label such as "megatron-nofp8" (apply
the change consistently across the file) so names reflect the actual
configuration.

policy:
model_name: meta-llama/Llama-3.1-8B-Instruct
tokenizer:
name: meta-llama/Llama-3.1-8B-Instruct
train_micro_batch_size: 1
logprob_batch_size: 2
max_total_sequence_length: 4096
make_sequence_length_divisible_by: 1
dtensor_cfg:
enabled: false
megatron_cfg:
enabled: true
empty_unused_memory_level: 1
converter_type: LlamaForCausalLM
pipeline_model_parallel_size: 2
activation_checkpointing: true
defer_fp32_logits: true
optimizer:
lr: 5.0e-07
min_lr: 5.0e-08
weight_decay: 0.0
use_precision_aware_optimizer: true
scheduler:
lr_warmup_iters: 2
lr_warmup_init: 5.0e-08
fp8_cfg:
enabled: false
generation:
max_new_tokens: 4096
stop_token_ids:
- 128009
vllm_cfg:
max_model_len: 4096
data:
max_input_seq_length: 4096
logger:
log_dir: logs/grpo-llama3.1-8b-instruct-1n8g-megatron-fp8-e2e
wandb_enabled: true
tensorboard_enabled: true
wandb:
project: nemo-rl
name: grpo-llama3.1-8b-instruct-1n8g-megatron-fp8-e2e
cluster:
gpus_per_node: 8
num_nodes: 1
Original file line number Diff line number Diff line change
@@ -0,0 +1,54 @@
defaults: ../../../grpo_math_1B.yaml
grpo:
num_prompts_per_step: 64
num_generations_per_prompt: 32
max_num_steps: 500
loss_fn:
use_importance_sampling_correction: true
checkpointing:
checkpoint_dir: results/grpo-llama3.1-8b-instruct-1n8g-megatron-fp8-e2e
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

Verify checkpoint directory naming.

The checkpoint directory references grpo-llama3.1-8b-instruct-1n8g-megatron-fp8-e2e, but this configuration is for a 2-node setup (2n8g). This naming inconsistency could cause confusion or conflicts.

Consider updating to match the node configuration:

-  checkpoint_dir: results/grpo-llama3.1-8b-instruct-1n8g-megatron-fp8-e2e
+  checkpoint_dir: results/grpo-llama3.1-8b-instruct-2n8g-megatron-fp8-e2e


🏁 Script executed:

#!/bin/bash
# Verify the file exists and check its content
echo "=== Checking target file ==="
if [ -f "examples/configs/recipes/llm/performance/grpo-llama3.1-8b-instruct-2n8g.yaml" ]; then
  echo "File found. Content around line 9:"
  sed -n '5,15p' examples/configs/recipes/llm/performance/grpo-llama3.1-8b-instruct-2n8g.yaml
else
  echo "File not found"
fi

echo -e "\n=== Checking for related config files ==="
fd "grpo-llama3.1-8b-instruct" examples/configs/recipes/llm/performance/ --type f

echo -e "\n=== Checking for other 1n8g configs in the same directory ==="
fd "1n8g" examples/configs/recipes/llm/performance/ --type f

echo -e "\n=== Checking for 2n8g configs in the same directory ==="
fd "2n8g" examples/configs/recipes/llm/performance/ --type f

Length of output: 996


🏁 Script executed:

#!/bin/bash
# Verify the file exists and inspect relevant sections
echo "=== Checking target file ==="
if [ -f "examples/configs/recipes/llm/performance/grpo-llama3.1-8b-instruct-2n8g.yaml" ]; then
  echo "File found. Full content:"
  cat examples/configs/recipes/llm/performance/grpo-llama3.1-8b-instruct-2n8g.yaml
else
  echo "File not found"
fi

echo -e "\n=== Checking for related grpo-llama3.1 config files ==="
fd "grpo-llama3.1" examples/configs/recipes/llm/performance/ --type f

echo -e "\n=== Checking for other configurations with 1n8g or 2n8g patterns ==="
fd "(1n8g|2n8g)" examples/configs/recipes/llm/ --type f | head -20

Length of output: 3367



🏁 Script executed:

#!/bin/bash
# Check the 1n8g config file to understand the pattern
echo "=== Content of grpo-llama3.1-8b-instruct-1n8g.yaml ==="
cat examples/configs/recipes/llm/performance/grpo-llama3.1-8b-instruct-1n8g.yaml

echo -e "\n=== Comparison: cluster config in both files ==="
echo "1n8g file cluster config:"
grep -A2 "^cluster:" examples/configs/recipes/llm/performance/grpo-llama3.1-8b-instruct-1n8g.yaml
echo ""
echo "2n8g file cluster config:"
grep -A2 "^cluster:" examples/configs/recipes/llm/performance/grpo-llama3.1-8b-instruct-2n8g.yaml

Length of output: 1605


Update checkpoint directory, log directory, and wandb name to match 2n8g configuration.

The 2n8g.yaml file is configured for 2 nodes (cluster.num_nodes: 2) but all output paths reference "1n8g" naming. This is a copy-paste artifact from the 1n8g file and causes a mismatch between cluster configuration and output naming. Update three locations:

  checkpoint_dir: results/grpo-llama3.1-8b-instruct-1n8g-megatron-fp8-e2e
+ checkpoint_dir: results/grpo-llama3.1-8b-instruct-2n8g-megatron-fp8-e2e
  log_dir: logs/grpo-llama3.1-8b-instruct-1n8g-megatron-fp8-e2e
+ log_dir: logs/grpo-llama3.1-8b-instruct-2n8g-megatron-fp8-e2e
    name: grpo-llama3.1-8b-instruct-1n8g-megatron-fp8-e2e
+   name: grpo-llama3.1-8b-instruct-2n8g-megatron-fp8-e2e

Committable suggestion skipped: line range outside the PR's diff.

🤖 Prompt for AI Agents
In examples/configs/recipes/llm/performance/grpo-llama3.1-8b-instruct-2n8g.yaml
around line 9, the checkpoint_dir (and similarly the log directory and wandb
project/name) still reference "1n8g" from the 1-node config; update these three
output paths/names to use the "2n8g" naming to match cluster.num_nodes: 2 (i.e.,
change checkpoint_dir, log_dir, and wandb name/project entries to use
results/grpo-llama3.1-8b-instruct-2n8g or equivalent 2n8g identifiers).

policy:
model_name: meta-llama/Llama-3.1-8B-Instruct
tokenizer:
name: meta-llama/Llama-3.1-8B-Instruct
train_micro_batch_size: 1
logprob_batch_size: 2
max_total_sequence_length: 4096
make_sequence_length_divisible_by: 1
dtensor_cfg:
enabled: false
megatron_cfg:
enabled: true
empty_unused_memory_level: 1
converter_type: LlamaForCausalLM
pipeline_model_parallel_size: 2
activation_checkpointing: true
defer_fp32_logits: true
optimizer:
lr: 5.0e-07
min_lr: 5.0e-08
weight_decay: 0.0
use_precision_aware_optimizer: true
scheduler:
lr_warmup_iters: 2
lr_warmup_init: 5.0e-08
fp8_cfg:
enabled: false
generation:
max_new_tokens: 4096
stop_token_ids:
- 128009
vllm_cfg:
max_model_len: 4096
data:
max_input_seq_length: 4096
logger:
log_dir: logs/grpo-llama3.1-8b-instruct-1n8g-megatron-fp8-e2e
wandb_enabled: true
tensorboard_enabled: true
wandb:
project: nemo-rl
name: grpo-llama3.1-8b-instruct-1n8g-megatron-fp8-e2e
Comment on lines +46 to +51
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

Verify logging directory and WandB naming.

The log directory and WandB name reference 1n8g, but this configuration is for a 2-node setup (2n8g). This naming inconsistency could cause confusion when analyzing experiments.

Consider updating to match the node configuration:

-  log_dir: logs/grpo-llama3.1-8b-instruct-1n8g-megatron-fp8-e2e
+  log_dir: logs/grpo-llama3.1-8b-instruct-2n8g-megatron-fp8-e2e
   wandb_enabled: true
   tensorboard_enabled: true
   wandb:
     project: nemo-rl
-    name: grpo-llama3.1-8b-instruct-1n8g-megatron-fp8-e2e
+    name: grpo-llama3.1-8b-instruct-2n8g-megatron-fp8-e2e


---

🏁 Script executed:

```shell
#!/bin/bash
# Extract full YAML configuration to verify cluster setup
cat examples/configs/recipes/llm/performance/grpo-llama3.1-8b-instruct-2n8g.yaml

Length of output: 1362


🏁 Script executed:

#!/bin/bash
# Search for similar recipe files to understand naming conventions
fd -e yaml 'grpo.*\.yaml' examples/configs/recipes/llm/performance/ | head -20

Length of output: 797


🏁 Script executed:

#!/bin/bash
# Check for any 1n8g variant of this configuration
fd -e yaml 'grpo-llama3.1.*1n8g' examples/configs/recipes/

Length of output: 248


🏁 Script executed:

#!/bin/bash
# Look for other 2n8g recipes to compare naming patterns
fd -e yaml '2n8g' examples/configs/recipes/ | head -10

Length of output: 833


🏁 Script executed:

#!/bin/bash
# Search for patterns with num_nodes in YAML files to understand cluster config
rg -n 'num_nodes|num_gpus_per_node|cluster' examples/configs/recipes/llm/performance/grpo-llama3.1-8b-instruct-2n8g.yaml -A 2

Length of output: 51


Update logging directory, checkpoint directory, and WandB naming to match 2-node cluster configuration.

The file is configured for a 2-node setup (cluster.num_nodes: 2) but the logging directory, checkpoint directory, and WandB name all reference 1n8g. Update all three locations to use 2n8g:

-  checkpoint_dir: results/grpo-llama3.1-8b-instruct-1n8g-megatron-fp8-e2e
+  checkpoint_dir: results/grpo-llama3.1-8b-instruct-2n8g-megatron-fp8-e2e
   ...
-  log_dir: logs/grpo-llama3.1-8b-instruct-1n8g-megatron-fp8-e2e
+  log_dir: logs/grpo-llama3.1-8b-instruct-2n8g-megatron-fp8-e2e
   wandb_enabled: true
   tensorboard_enabled: true
   wandb:
     project: nemo-rl
-    name: grpo-llama3.1-8b-instruct-1n8g-megatron-fp8-e2e
+    name: grpo-llama3.1-8b-instruct-2n8g-megatron-fp8-e2e

Committable suggestion skipped: line range outside the PR's diff.

🤖 Prompt for AI Agents
In examples/configs/recipes/llm/performance/grpo-llama3.1-8b-instruct-2n8g.yaml
around lines 46 to 51, update the logging directory, checkpoint directory, and
WandB name to reflect the 2-node configuration: change log_dir value from the
1n8g path to the equivalent 2n8g path, change the checkpoint directory entry
(wherever checkpoint_dir/checkpoints are defined in this file) from 1n8g to
2n8g, and update wandb.name from grpo-llama3.1-8b-instruct-1n8g-megatron-fp8-e2e
to grpo-llama3.1-8b-instruct-2n8g-megatron-fp8-e2e so all three consistently
reference 2n8g.

cluster:
gpus_per_node: 8
num_nodes: 2
Original file line number Diff line number Diff line change
@@ -0,0 +1,42 @@
defaults: ../../../grpo_math_1B.yaml
grpo:
num_prompts_per_step: 64
num_generations_per_prompt: 32
checkpointing:
enabled: false
checkpoint_dir: results/grpo-llama3.3-70b-instruct-4n8g-16k
policy:
model_name: meta-llama/Llama-3.3-70B-Instruct
train_micro_batch_size: 1
max_total_sequence_length: 16384
dtensor_cfg:
enabled: false
optimizer: null
scheduler: null
make_sequence_length_divisible_by: ${policy.megatron_cfg.tensor_model_parallel_size}
megatron_cfg:
enabled: true
empty_unused_memory_level: 1
activation_checkpointing: true
tensor_model_parallel_size: 4
pipeline_model_parallel_size: 8
sequence_parallel: true
optimizer:
lr: 3.0e-07
min_lr: 3.0e-08
scheduler:
lr_warmup_iters: 2
lr_warmup_init: 3.0e-08
generation:
vllm_cfg:
tensor_parallel_size: 4
logger:
log_dir: logs/grpo-llama3.3-70b-instruct-4n8g-16k
wandb_enabled: true
tensorboard_enabled: true
wandb:
project: nemo-rl
name: grpo-llama3.3-70b-instruct-4n8g-16k
cluster:
gpus_per_node: 8
num_nodes: 4
Original file line number Diff line number Diff line change
@@ -0,0 +1,41 @@
defaults: ../../../grpo_math_1B.yaml
grpo:
num_prompts_per_step: 64
num_generations_per_prompt: 32
checkpointing:
enabled: false
checkpoint_dir: results/grpo-llama3.3-70b-instruct-4n8g
policy:
model_name: meta-llama/Llama-3.3-70B-Instruct
train_micro_batch_size: 1
max_total_sequence_length: 4096
dtensor_cfg:
enabled: false
optimizer: null
scheduler: null
make_sequence_length_divisible_by: ${policy.megatron_cfg.tensor_model_parallel_size}
megatron_cfg:
enabled: true
empty_unused_memory_level: 1
tensor_model_parallel_size: 4
pipeline_model_parallel_size: 8
sequence_parallel: true
optimizer:
lr: 3.0e-07
min_lr: 3.0e-08
scheduler:
lr_warmup_iters: 2
lr_warmup_init: 3.0e-08
generation:
vllm_cfg:
tensor_parallel_size: 4
logger:
log_dir: logs/grpo-llama3.3-70b-instruct-4n8g
wandb_enabled: true
tensorboard_enabled: true
wandb:
project: nemo-rl
name: grpo-llama3.3-70b-instruct-4n8g
cluster:
gpus_per_node: 8
num_nodes: 4
Original file line number Diff line number Diff line change
@@ -0,0 +1,61 @@
defaults: ../../../grpo_math_1B.yaml
grpo:
num_prompts_per_step: 16
num_generations_per_prompt: 32
max_num_steps: 500
val_batch_size: 5
max_val_samples: 16
loss_fn:
use_importance_sampling_correction: true
checkpointing:
checkpoint_dir: results/grpo-qwen3-235b-16n8g
policy:
model_name: Qwen/Qwen3-235B-A22B
tokenizer:
name: Qwen/Qwen3-235B-A22B
train_micro_batch_size: 1
logprob_batch_size: 1
max_total_sequence_length: 8192
make_sequence_length_divisible_by: 1
dtensor_cfg:
enabled: false
megatron_cfg:
enabled: true
empty_unused_memory_level: 1
converter_type: LlamaForCausalLM
tensor_model_parallel_size: 2
sequence_parallel: true
pipeline_model_parallel_size: 8
context_parallel_size: 2
expert_model_parallel_size: 16
activation_checkpointing: true
num_layers_in_first_pipeline_stage: 11
num_layers_in_last_pipeline_stage: 11
moe_permute_fusion: true
defer_fp32_logits: true
optimizer:
lr: 5.0e-07
min_lr: 5.0e-08
weight_decay: 0.0
use_precision_aware_optimizer: true
scheduler:
lr_warmup_iters: 2
lr_warmup_init: 5.0e-08
fp8_cfg:
enabled: false
generation:
stop_token_ids:
- 128009
vllm_cfg:
tensor_parallel_size: 16
async_engine: true
logger:
log_dir: logs/grpo-qwen3-235b-16n8g
wandb_enabled: true
tensorboard_enabled: false # to avoid a bug
wandb:
project: nemo-rl
name: grpo-qwen3-235b-16n8g
cluster:
gpus_per_node: 8
num_nodes: 16
Original file line number Diff line number Diff line change
@@ -0,0 +1,61 @@
defaults: ../../../grpo_math_1B.yaml
grpo:
num_prompts_per_step: 16
num_generations_per_prompt: 32
max_num_steps: 500
val_batch_size: 5
max_val_samples: 16
loss_fn:
use_importance_sampling_correction: true
checkpointing:
checkpoint_dir: results/grpo-qwen3-235b-16n8g
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Fix checkpoint directory naming mismatch.

The checkpoint directory references grpo-qwen3-235b-16n8g, but this configuration is for a 32-node setup (32n8g). This will cause checkpoint conflicts if both configurations are used.

Apply this diff:

-  checkpoint_dir: results/grpo-qwen3-235b-16n8g
+  checkpoint_dir: results/grpo-qwen3-235b-32n8g
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
checkpoint_dir: results/grpo-qwen3-235b-16n8g
checkpoint_dir: results/grpo-qwen3-235b-32n8g
🤖 Prompt for AI Agents
In examples/configs/recipes/llm/performance/grpo-qwen3-235b-32n8g.yaml around
line 11, the checkpoint_dir is incorrectly set to results/grpo-qwen3-235b-16n8g;
update it to results/grpo-qwen3-235b-32n8g so the directory name matches the
32-node configuration and avoids checkpoint conflicts, and verify any other
references or scripts that point to the old path are updated accordingly.

policy:
model_name: Qwen/Qwen3-235B-A22B
tokenizer:
name: Qwen/Qwen3-235B-A22B
train_micro_batch_size: 1
logprob_batch_size: 1
max_total_sequence_length: 8192
make_sequence_length_divisible_by: 1
dtensor_cfg:
enabled: false
megatron_cfg:
enabled: true
empty_unused_memory_level: 1
converter_type: LlamaForCausalLM
tensor_model_parallel_size: 2
sequence_parallel: true
pipeline_model_parallel_size: 8
context_parallel_size: 2
expert_model_parallel_size: 16
activation_checkpointing: true
num_layers_in_first_pipeline_stage: 11
num_layers_in_last_pipeline_stage: 11
moe_permute_fusion: true
defer_fp32_logits: true
optimizer:
lr: 5.0e-07
min_lr: 5.0e-08
weight_decay: 0.0
use_precision_aware_optimizer: true
scheduler:
lr_warmup_iters: 2
lr_warmup_init: 5.0e-08
fp8_cfg:
enabled: false
generation:
stop_token_ids:
- 128009
vllm_cfg:
tensor_parallel_size: 16
async_engine: true
logger:
log_dir: logs/grpo-qwen3-235b-16n8g
wandb_enabled: true
tensorboard_enabled: false # to avoid a bug
wandb:
project: nemo-rl
name: grpo-qwen3-235b-16n8g
Comment on lines +53 to +58
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Fix logging directory and WandB naming mismatch.

The log directory and WandB name reference 16n8g, but this configuration is for a 32-node setup (32n8g). This will cause confusion and potential data mixing in experiment tracking.

Apply this diff:

-  log_dir: logs/grpo-qwen3-235b-16n8g
+  log_dir: logs/grpo-qwen3-235b-32n8g
   wandb_enabled: true
   tensorboard_enabled: false # to avoid a bug
   wandb:
     project: nemo-rl
-    name: grpo-qwen3-235b-16n8g
+    name: grpo-qwen3-235b-32n8g
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
log_dir: logs/grpo-qwen3-235b-16n8g
wandb_enabled: true
tensorboard_enabled: false # to avoid a bug
wandb:
project: nemo-rl
name: grpo-qwen3-235b-16n8g
log_dir: logs/grpo-qwen3-235b-32n8g
wandb_enabled: true
tensorboard_enabled: false # to avoid a bug
wandb:
project: nemo-rl
name: grpo-qwen3-235b-32n8g
🤖 Prompt for AI Agents
In examples/configs/recipes/llm/performance/grpo-qwen3-235b-32n8g.yaml around
lines 53 to 58 the log_dir and wandb.name incorrectly reference "16n8g" while
this config is for the 32-node setup; update log_dir to
logs/grpo-qwen3-235b-32n8g and set wandb.name to grpo-qwen3-235b-32n8g so both
filesystem logging and WandB experiment name match the 32n8g configuration.

cluster:
gpus_per_node: 8
num_nodes: 32
Loading
Loading