This paper investigates an under-explored challenge in large language models (LLMs): chain-of-thought prompting with noisy rationales, which include irrelevant or inaccurate reasoning thoughts within examples used for in-context learning. We construct NoRa dataset that is tailored to evaluate the robustness of reasoning in the presence of noisy rationales. Our findings on NoRa dataset reveal a prevalent vulnerability to such noise among current LLMs, with existing robust methods like self-correction and self-consistency showing limited efficacy. Notably, compared to prompting with clean rationales, GPT-3.5 drops by 1.4%-19.8% in accuracy with irrelevant thoughts and more drastically by 2.2%-40.4% with inaccurate thoughts.
Addressing this challenge necessitates external supervision that should be accessible in practice. Here, we propose the method of contrastive denoising with noisy chain-of-thought (CD-CoT). It enhances LLMs’ denoising-reasoning capabilities by contrasting noisy rationales with only one clean rationale, which can be the minimal requirement for denoising-purpose prompting. This method follows a principle of exploration and exploitation: (1) rephrasing and selecting rationales in the input space to achieve explicit denoising and (2) exploring diverse reasoning paths and voting on answers in the output space. Empirically, CD-CoT demonstrates an average improvement of 17.8% in accuracy over the base model and shows significantly stronger denoising capabilities than baseline methods.
Choose one of the following installation commands based on your OpenAI API version:
## Use new openai API:
pip install openai requests pandas nltk pyyaml scikit-learn tiktoken python-dotenv
## Use old openai API:
pip install openai==0.28 requests pandas nltk pyyaml scikit-learn tiktoken python-dotenv
You can set up your OpenAI API credentials using either environment variables or a configuration file:
export OPENAI_API_KEY=[YOUR_API_KEY_HERE]
# if you have mutiple API keys
export OPENAI_API_KEY=[key1]:[key2]:[key3]...
# if you required specific API base
export OPENAI_API_BASE=[YOUR_API_BASE_HERE]
or create a .env file in the project path:
Create a .env
file in the project root directory:
OPENAI_API_KEY=[YOUR_API_KEY_HERE]
# if you have mutiple keys
OPENAI_API_KEY=[key1]:[key2]:[key3]...
# if you required specific api base
OPENAI_API_BASE=[YOUR_API_BASE_HERE]
The repository is organized as follows:
-
data/
: Contains raw datasets and preprocessed datasets- Original datasets used for generation
- Pre-processed datasets ready for experiments
-
data_process/
: Libraries and utilities for dataset processing and manipulation -
method/
: Implementation of different noise handling methods- Various approaches for handling noisy rationales in chain-of-thought prompting
-
llm_model/
: Interfaces for different large language models- Wrappers and utilities for interacting with various LLMs
-
noise_test.py
: Main experiment script for testing noise rationale handling -
config.yml
: Configuration file for experiment settings- Model parameters
- Dataset options
- Testing configurations
NoRa supports three task categories with corresponding subtasks:
- Math: base-9, base-11
- Symbolic: equal, longer
- Commonsense: (no subtasks)
Configure experiment settings in config.yml
and run:
python noise_test.py
For detailed configuration parameters, please refer to config/introduction.md
# Zero-shot testing
python noise_test.py -task [task]_[subtask]_zeroshot -method [basemodel|CD-CoT] -model gpt-3.5-turbo-0125 -test_num [test_num]
# Clean ICL testing
python noise_test.py -task [task]_[subtask]_clean -method [basemodel|CD-CoT] -model gpt-3.5-turbo-0125 -test_num [test_num]
# Noisy ICL testing
python noise_test.py -task [task]_[subtask]_[irrelevant|inaccurate]_[easy|medium|hard] -method [basemodel|CD-CoT] -model gpt-3.5-turbo-0125 -test_num [test_num]
Parameters: Available arguments:
task_subtask
: math_base-9, math_base-11, symbolic_equal, symbolic_longer, commonsensemethod
: basemodel, CD-CoT, smoothllm, selfdenoise, selfpolish, contrastivecot, ISC, SCO, BTnoise-type
: irrelevant, inaccuratedifficulty
: easy, medium, hardmodel
: e.g., gpt-3.5-turbo-0125test_num
: e.g., 100
Examples:
# Clean data testing
python noise_test.py -task math_base-9_clean -method basemodel
# Noisy data testing with different configurations
python noise_test.py -task math_base-9_inaccurate_easy -method basemodel
python noise_test.py -task symbolic_longer_irrelevant_easy -method CD-CoT
python noise_test.py -task commonsense_inaccurate_hard -method contrastivecot
The results would appear in ./results/{task}/{subtask}/{model}/{method}/
The file will be log_[ICL_|][n_clean_shots]clean_[noise_[n_noisy_shots][inaccurate|irrelevant]_[fixed|random]_ratio[ratio]|origin]_case[cases_num]_temp[temperature]_n[reasoning_times].json
If you find our work helpful, please kindly cite our paper:
@inproceedings{zhou2024can,
title={Can Language Models Perform Robust Reasoning in Chain-of-thought Prompting with Noisy Rationales?},
author={Zhou, Zhanke and Tao, Rong and Zhu, Jianing and Luo, Yiwen and Wang, Zengmao and Han, Bo},
booktitle={The Thirty-eighth Annual Conference on Neural Information Processing Systems (NeurIPS 2024)},
year={2024},
url={https://openreview.net/pdf?id=FbuODM02ra}
}