-
Notifications
You must be signed in to change notification settings - Fork 29
Description
First, thanks for releasing this great work and the code! 🙏
I’m trying to reproduce the results with run_efficient_gpt4o_judge.sh on 8 × A800 80GB. After a few minor config changes (listed below), I see:
-
Tool-call metrics drift down (model prefers direct answers more and more).
-
Training becomes unstable (exploding grad norm) and responses grow extremely long.
-
The model start to outputs highly repetitive strings.
I’d appreciate guidance on whether my settings are problematic and how to stabilize this setup.
Environment
- GPUs: 8 × A800 80GB
- CUDA 12.2
- transformers 4.51.0
- torch 2.6.0
- flash_attn 2.7.4.post1
- accelerate 1.10.0
Command & key overrides
I run the run_efficient_gpt4o_judge.sh., and the main changes vs. the script:
I set data.train_batch_size=32 to log rewards more frequently (as the verl doc show, it dont't influent the mini batch for PPO actor updates), and rollout.log_prob_micro_batch_size_per_gpu=16 to speed up log-prob computation, and max_num_batched_tokens=16384 to save memory
What I observe
-
Tool-call metrics drift: tool_call/success_tool_call_rate decreases over time.
-
Instability: meet gradient explosion during training (actor/grad_norm values up to ~1e10); and the response_length starts to explode and the model starts to emits a very long sequence repeating, Some logs can be viewed here
Questions / suspected causes
Does reducing data.train_batch_size to 32 harm GRPO stability?
My understanding from the docs is that this value is the global batch size of prompts used to generate a set of sampled trajectories/rollouts. The number of responses/trajectories is data.train_batch_size * actor_rollout.ref.rollout.n, and this set of sampled trajectories is split into multiple mini-batches with batch_size=ppo_mini_batch_size for PPO actor updates. So this does not affect the batch size used for each gradient update.
Thanks again for the great project! I’m happy to run ablations if you suggest specific knobs to tweak.