Experiments with reasoning models calibration with human experts confidence.
This repository presents the code for the experiments in Don't Think Twice! Over-Reasoning Impairs Confidence Calibration, published at ICML 2025 Workshop on Reliable and Responsible Foundation Models.
- Create a virtual environment:
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate- Install dependencies:
pip install -r requirements.txtThe main script main.py runs experiments with different LLM models:
python main.py --input dataset/ipcc_statements.tsv --output xp/xp-20250710/results-gemini_2_5.tsv --models gemini_2_5_pro --rpm 60 --template original --max-tokens 500python main.py --input dataset/iarc.tsv --models gemini_2_5_flash --rpm 60 --template iarc --output xp/xp-20250710/results-iarc-gemini_2_5_flash-reasoning-64.tsv --reasoning-budget 64Arguments:
--input: Input TSV file with columns: statement_idx, statement, confidence--output: Output TSV file path for results--models: Comma-separated list of model names or "all" (see models.yaml)--rpm: Rate limit in requests per minute (default: 60)--template: Name of prompt template to use (default: original)--max-tokens: Maximum tokens per response (optional, no limit if not specified)--reasoning-budget: Reasoning budget for Gemini models (optional)--use-search: Enable web search for Gemini models (optional)
Use eval.py to compute accuracy metrics from experiment results:
python eval.py results.tsvThis will output:
- Overall accuracy
- Confidence score (0.0: lowest, 3.0: highest)
- Number of correct predictions
- Error statistics
main.py: Main experiment runnereval.py: Results evaluation scriptrouter_adapter.py: API router for different LLM modelslabel_parser.py: Parser for model responsesmodels.yaml: Model configurationsprompts/: Prompt templatesutils/: Utility functions and helper moduleslogs/: API request logs and error trackingxp/: Experiment data, including input statements and model outputs
If you found our work helpful, please cite us:
@inproceedings{lacombe2025think, title={Don't Think Twice! Over-Reasoning Impairs Confidence Calibration}, author={Romain Lacombe and Kerrie Wu and Eddie Dilworth}, booktitle={ICML 2025 Workshop on Reliable and Responsible Foundation Models}, year={2025}, url={https://openreview.net/forum?id=e7G5aeMOUP} }
