efficient_gpt4o_judge.sh run becomes unstable: tool-call rate drops, grad-norm spikes, very long/repetitive outputs

First, thanks for releasing this great work and the code! 🙏
I’m trying to reproduce the results with run_efficient_gpt4o_judge.sh on 8 × A800 80GB. After a few minor config changes (listed below), I see:

- Tool-call metrics drift down (model prefers direct answers more and more).

- Training becomes unstable (exploding grad norm) and responses  grow extremely long.

- The model start to outputs highly repetitive strings.

I’d appreciate guidance on whether my settings are problematic and how to stabilize this setup.

**Environment**

- GPUs: 8 × A800 80GB
- CUDA 12.2
- transformers 4.51.0
- torch 2.6.0
- flash_attn 2.7.4.post1
- accelerate 1.10.0

**Command & key overrides**

I run the run_efficient_gpt4o_judge.sh., and the main changes vs. the script:

<img width="861" height="1644" alt="Image" src="https://github.com/user-attachments/assets/3776c271-6cb9-4864-acfe-caad46387b90" />

I set data.train_batch_size=32 to log rewards more frequently  (as the verl [doc](https://verl.readthedocs.io/en/latest/algo/grpo.html) show, it dont't influent the mini batch for PPO actor updates), and rollout.log_prob_micro_batch_size_per_gpu=16 to speed up log-prob computation, and max_num_batched_tokens=16384 to save memory


**What I observe**

- Tool-call metrics drift: tool_call/success_tool_call_rate decreases over time.

- Instability: meet gradient explosion during training (actor/grad_norm values up to ~1e10);  and the response_length starts to explode and the model starts to emits a very long sequence repeating, Some logs can be viewed [here](https://drive.google.com/file/d/1amspsMrEEB9UsPMPAkwtGErD6_JDCdlg/view?usp=sharing)



<img width="3149" height="1261" alt="Image" src="https://github.com/user-attachments/assets/cd5a4fc8-6cd6-4fca-ab3a-c6adf7df59a5" />

<img width="3142" height="1262" alt="Image" src="https://github.com/user-attachments/assets/7edb4208-762e-4d5e-bd5d-38c89606e20e" />

<img width="3148" height="1718" alt="Image" src="https://github.com/user-attachments/assets/07727f4a-fe0e-44a4-ab30-27f18bda1519" />

<img width="3159" height="1280" alt="Image" src="https://github.com/user-attachments/assets/40b674f1-2385-404e-aa07-2f4fe2e2c355" />

<img width="2568" height="676" alt="Image" src="https://github.com/user-attachments/assets/62d8b4be-30aa-4b74-9e81-6fb4ce247205" />

<img width="3636" height="1865" alt="Image" src="https://github.com/user-attachments/assets/1826f7c8-fba7-46a0-921f-b712556d4031" />

Questions / suspected causes

Does reducing data.train_batch_size to 32 harm GRPO stability?
My understanding from the docs is that this value is the global batch size of prompts used to generate a set of sampled trajectories/rollouts. The number of responses/trajectories is data.train_batch_size * actor_rollout.ref.rollout.n,  and this set of sampled trajectories is split into multiple mini-batches with batch_size=ppo_mini_batch_size for PPO actor updates. So this does not affect the batch size used for each gradient update.

Thanks again for the great project! I’m happy to run ablations if you suggest specific knobs to tweak.



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

efficient_gpt4o_judge.sh run becomes unstable: tool-call rate drops, grad-norm spikes, very long/repetitive outputs #10

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

efficient_gpt4o_judge.sh run becomes unstable: tool-call rate drops, grad-norm spikes, very long/repetitive outputs #10

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions