This is the codebase to reproduce the experiment in the paper:
From Demonstrations to Rewards: Alignment Without Explicit Human Preferences
Install the dependencies
pip install -r requirements.txt
pip install poetry
Demonstration is generated by a well-trained policy model vwxyzjn/EleutherAI_pythia-6.9b-deduped__ reward__tldr(https://huggingface.co/vwxyzjn/EleutherAI_pythia-6.9b-deduped__reward__tldr/tree/reward__44413__1706651113)
./bash/sft_data_generation.sh
After running this, our demonstration is generated in /generated_data folder. Currently, we use a 2.8B model to generate demonstrations on an A40 GPU to prevent OOM issues. You can configure the model for demonstration generation in sft_data_generation.bash.
Then we run
./bash/IRL_Pipeline.sh
Our IRL pipeline includes four steps:
Details are in the bash.
We evaluate the proposed IRL method from the quality of the estimated reward models and policy models.
- For the reward model, we evaluate the reward accuracy on a hold-out TL;DR preference dataset.
- For the policy model, we evaluate the performance of the model according to the reward score from a hold-out 6.9B reward model and also the ChatGPT-evaluted win-rate compared with a high-quality reference dataset generated from a public 6.9B PPO model (vwxyzjn/EleutherAI_pythia-6.9b-deduped__ppo_left_padding_new_nowhiten_reward__tldr)