Bug Description
I'm trying to reproduce the reward curve of examples/geo3k_multi_turn, but I failed; the rewards do not increase. Even the first step raw reward is around 0.1, which is expected to be around 0.3.....
Steps to Reproduce
- use the official Docker container;
- run examples/geo3k_multi_turn training script.
Expected Behavior
The first step raw reward is around 0.3
Actual Behavior
The first step raw reward is around 0.1
Environment
- slime version:
- Python version:
- PyTorch version:
- CUDA/ROCm version:
- GPU type and count:
- OS:
- SGLang version (if relevant):
- Megatron-LM version (if relevant):
Logs
Additional Context
No response
Pre-submission Checklist
Bug Description
I'm trying to reproduce the reward curve of examples/geo3k_multi_turn, but I failed; the rewards do not increase. Even the first step raw reward is around 0.1, which is expected to be around 0.3.....
Steps to Reproduce
Expected Behavior
The first step raw reward is around 0.3
Actual Behavior
The first step raw reward is around 0.1
Environment
Logs
Additional Context
No response
Pre-submission Checklist