This repository contains code for fine-tuning Large Language Models (LLMs) to learn the boundaries of refusal, specifically for cases where the provided context lacks sufficient information.
For a detailed explanation of the experiment and results, check out the blog post: Fine-Tuning LLMs for Refusal.
The goal of this project is to calibrate LLMs to refuse answering questions when:
- The answer is not present in the provided context.
- The context is missing entirely.
- The question might lead to hallucinations.
.
├── src/
│ ├── dataset_prep.py # Data preparation and calibration logic
│ ├── train.py # QLoRA fine-tuning script
│ └── model_eval.py # Evaluation and judging logic
├── data/ # (Optional) Directory for raw/processed datasets
├── outputs/ # Model checkpoints and metrics
├── requirements.txt # Project dependencies
├── LICENSE # MIT License
└── README.md # Project documentation
# Clone the repository
git clone https://github.com/your-username/refusal-training.git
cd refusal-training
# Install dependencies
pip install -r requirements.txtGenerate calibrated training, evaluation, and test datasets from various sources (SQuAD, Natural Questions, etc.).
python src/dataset_prep.pyTrain the model using QLoRA. The script is configured for Qwen-based models but can be adapted.
python src/train.pyEvaluate the fine-tuned model's refusal accuracy and hallucination rate using an LLM judge (via OpenRouter).
python src/model_eval.pyThe training workflow involves creating a balanced dataset with:
- Answerable Questions: Standard RAG examples.
- Unanswerable Questions: Questions where the context is provided but doesn't contain the answer.
- Missing Context: Questions with no context where a refusal is expected if information is unknown.
- Hallucination Prevention: Ensuring the model doesn't fabricate facts.
This project is licensed under the MIT License - see the LICENSE file for details.