Different results among many runs

Hi @Yangsenqiao ,

I encountered inconsistent results when running the script run_efficient_gpt4o_judge.sh twice independently:


1. First run: The success tool call ratio increased rapidly, reaching ~1.0 by the 10th step. This suggests the model may have collapsed to using tools for nearly all samples This behavior does not match the green curve in Fig. 3(a) of the paper.

<img width="1192" height="436" alt="Image" src="https://github.com/user-attachments/assets/9d36c768-d766-4eef-bf26-0b22c4c6ea5c" />

2. Second run: The ratio increased more slowly, reaching at ~0.65 by the 5th step.

<img width="1192" height="444" alt="Image" src="https://github.com/user-attachments/assets/ff6519fa-07fd-4b7b-957c-ba124e13cb65" />

Did you observe similar phenomenons in your experiments? 

Look forward to your reply. Thanks!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Different results among many runs #9

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Different results among many runs #9

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions