Skip to content

Commit

Permalink
Merge branch 'main' of https://github.com/cassidylaidlaw/orpo into main
Browse files Browse the repository at this point in the history
  • Loading branch information
Shivam Singhal committed Nov 11, 2024
2 parents d4ffb66 + ca890b8 commit 66b4b4c
Showing 1 changed file with 26 additions and 10 deletions.
36 changes: 26 additions & 10 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,9 @@
# Correlated Proxies: A New Definition and Improved Mitigation for Reward Hacking

This repository contains code for the paper "[Correlated Proxies: A New Definition and Improved Mitigation for Reward Hacking](https://arxiv.org/abs/2403.03185)". The code for running ORPO on the four non-RLHF environments is in this repository, while the code for RLHF experiments is at https://github.com/cassidylaidlaw/llm_optimization.
This repository contains code for the paper "[Correlated Proxies: A New Definition and Improved Mitigation for Reward Hacking](https://arxiv.org/abs/2403.03185)".

ORPO is currently supported for RLHF and four non-RLHF environments. The code for non-RLHF experiments is in this repository, while the RLHF experiment code is at https://github.com/cassidylaidlaw/llm_optimization.


All Python code is under the `occupancy_measures` package. Run

Expand All @@ -11,30 +14,43 @@ to install dependencies.
## Training the ORPO policies
Checkpoints for the behavioral cloning (BC) trained safe policies are stored within the `data/safe_policy_checkpoints` directory. For now, these checkpoints were generated in Python 3.9, but in the future, we will provide checkpoints that work with all python versions. You can use these checkpoints to train your own ORPO policies using the following commands:

- state-action OM KL:
- state-action occupancy measure regularization:
```
python -m occupancy_measures.experiments.orpo_experiments with env_to_run=$ENV reward_fun=proxy exp_algo=ORPO 'om_divergence_coeffs=['$COEFF']' 'checkpoint_to_load_policies=["'$BC_CHECKPOINT'"]' checkpoint_to_load_current_policy=$BC_CHECKPOINT seed=$SEED experiment_tag=state-action
python -m occupancy_measures.experiments.orpo_experiments with env_to_run=$ENV reward_fun=proxy exp_algo=ORPO 'om_divergence_coeffs=['$COEFF']' 'checkpoint_to_load_policies=["'$BC_CHECKPOINT'"]' checkpoint_to_load_current_policy=$BC_CHECKPOINT seed=$SEED experiment_tag=state-action 'om_divergence_type=["'$TYPE'"]'
```
- state OM KL:
- state occupancy measure regularization:
```
python -m occupancy_measures.experiments.orpo_experiments with env_to_run=$ENV reward_fun=proxy exp_algo=ORPO 'om_divergence_coeffs=['$COEFF']' use_action_for_disc 'checkpoint_to_load_policies=["'$BC_CHECKPOINT'"]' checkpoint_to_load_current_policy=$BC_CHECKPOINT seed=$SEED experiment_tag=state
python -m occupancy_measures.experiments.orpo_experiments with env_to_run=$ENV reward_fun=proxy exp_algo=ORPO 'om_divergence_coeffs=['$COEFF']' use_action_for_disc 'checkpoint_to_load_policies=["'$BC_CHECKPOINT'"]' checkpoint_to_load_current_policy=$BC_CHECKPOINT seed=$SEED experiment_tag=state 'om_divergence_type=["'$TYPE'"]'
```
- action distribution KL:
- action distribution regularization:
```
python -m occupancy_measures.experiments.orpo_experiments with env_to_run=$ENV reward_fun=proxy exp_algo=ORPO action_dist_kl_coeff=$COEFF seed=$SEED 'checkpoint_to_load_policies=["'$BC_CHECKPOINT'"]' checkpoint_to_load_current_policy=$BC_CHECKPOINT experiment_tag=AD
python -m occupancy_measures.experiments.orpo_experiments with env_to_run=$ENV reward_fun=proxy exp_algo=ORPO action_dist_kl_coeff=$COEFF seed=$SEED 'checkpoint_to_load_policies=["'$BC_CHECKPOINT'"]' checkpoint_to_load_current_policy=$BC_CHECKPOINT experiment_tag=AD 'om_divergence_type=["'$TYPE'"]'
```
- true reward:
```
python -m occupancy_measures.experiments.orpo_experiments with env_to_run=$ENV reward_fun=true exp_algo=ORPO 'om_divergence_coeffs=['$COEFF']' 'checkpoint_to_load_policies=["'$BC_CHECKPOINT'"]' checkpoint_to_load_current_policy=$BC_CHECKPOINT seed=$SEED experiment_tag=state-action
```

You can set ```ENV``` to any of the following options:
- tomato level=4 (defined within ```occupancy_measures/envs/tomato_environment.py```
- pandemic ([repo](https://github.com/shivamsinghal001/pandemic))
- traffic ([repo](https://github.com/shivamsinghal001/flow_reward_misspecification))
- pandemic ([repo](https://github.com/shivamsinghal001/pandemic))
- glucose ([repo](https://github.com/shivamsinghal001/glucose))
- tomato level=4 (defined within ```occupancy_measures/envs/tomato_environment.py```

You can set ```TYPE``` to any of the following divergences:
- $\sqrt{\chi^2}$ divergence: "sqrt_chi2"
- $\chi^2$ divergence: "chi2"
- Kullback-Leibler divergence: "kl"
- Total variation: "tv"
- Wasserstein: "wasserstein"

For our experiments using $\sqrt{\chi^2}$ divergence, we ran experiments with the following range of scale-independent coefficients for each regularization technique: 1, 0.5, 0.2, 0.1, 0.05, 0.02, 0.01. These must be multiplied by the per-timestep proxy rewards under the safe policy for each environment:
- traffic: 2e-4
- pandemic: 0.08
- glucose: 0.05
- tomato: 0.05

We ran experiments with the following range of scale-independent coefficients for each regularization technique: 0.0, 0.001, 0.002, 0.005, 0.01, 0.02, 0.05, 0.1, 0.2, 0.5, 1.0, 2.0, 5.0, 10.0, 20.0. These must be multiplied by the per-timestep rewards in each environment:
For our experiments using KL divergence, we ran experiments with the following range of scale-independent coefficients for each regularization technique: 0.0, 0.001, 0.002, 0.005, 0.01, 0.02, 0.05, 0.1, 0.2, 0.5, 1.0, 2.0, 5.0, 10.0, 20.0. These must be multiplied by the per-timestep rewards in each environment:
- traffic: 0.0005
- pandemic: 0.06
- glucose: 0.03
Expand Down

0 comments on commit 66b4b4c

Please sign in to comment.