Project Website · Paper · Platform · Datasets · Clean Offline RLHF
This is the official PyTorch implementation of the paper "Uni-RLHF: Universal Platform and Benchmark Suite for Reinforcement Learning with Diverse Human Feedback". Clean-Offline-RLHF is an Offline Reinforcement Learning with Human Feedback codebase that provides high-quality and realistic human feedback implementations of offline RL algorithms.
- [03-26-2024] 🔥 Update Mini-Uni-RLHF, a minimal out-of-the-box annotation tool for researchers, powered by streamlit.
- [03-24-2024] Release of SMARTS environment training dataset, scripts and labels. You can find it in the smarts branch.
- [03-20-2024] Update detail setup bash files.
- [02-22-2024] Initial code release.
- Clone the repo
git clone https://github.com/thomas475/Clean-Offline-RLHF.git cd Clean-Offline-RLHF - Setup Anaconda environment
conda create -n rlhf python==3.9 conda activate rlhf
- Install Dependencies
pip install -r requirements.txt
- Install hdf5
conda install anaconda::hdf5
- Install Torch
-
For GPU Support (CUDA):
pip install torch==1.13.1+cu117 torchvision==0.14.1+cu117 torchaudio==0.13.1 --extra-index-url https://download.pytorch.org/whl/cu117 -
For CPU Only:
pip install torch==1.13.1+cpu torchvision==0.14.1+cpu torchaudio==0.13.1 --extra-index-url https://download.pytorch.org/whl/cpu
-
Many of the datasets use MuJoCo as environment, so it should be installed, too. See this for further details.
- Download the MuJoCo library:
wget https://mujoco.org/download/mujoco210-linux-x86_64.tar.gz
- Create the MuJoCo folder:
mkdir ~/.mujoco - Extract the library to the MuJoCo folder:
tar -xvf mujoco210-linux-x86_64.tar.gz -C ~/.mujoco/ - Add environment variables (run
nano ~/.bashrc):export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$HOME/.mujoco/mujoco210/bin export MUJOCO_GL=egl
- Reload the .bashrc file to register the changes.
source ~/.bashrc
- Install dependencies:
conda install -c conda-forge patchelf fasteners cython==0.29.37 cffi pyglfw libllvm11 imageio glew glfw mesalib sudo apt-get install libglew-dev
- Test that the library is installed.
cd ~/.mujoco/mujoco210/bin ./simulate ../model/humanoid.xml
Before using offline RLHF algorithm, you should annotate your dataset using human feedback. If you wish to collect labeled dataset to new tasks, we refer to the platform part for crowdsourced annotation. The already collected crowdsourced annotation datasets (~15M steps) are available here.
The processed crowdsourced (CS) and scripted teacher (ST) labels can be found in the crowdsource_human_labels and generated_fake_labels folders, respectively.
Note: for comparison and validation purposes, we provide fast track for scripted teacher (ST) label generation in fast_track/generate_d4rl_fake_labels.py.
The exported labels from the Uni-RLHF-Platform have to be transformed into an approprite format first. To do this you can use the following script (replace [dir_path] with the location of the raw labels):
cd scripts
python3 transform_raw_labels.py --data_dir [dir_path]You can configure the training of the auxiliary models (reward model, attribute mapping model, keypoint prediction model) by creating a custom config.yaml file (available parameters can be seen in TrainConfig object in rlhf/train_model.py). To then train these models you run the following command:
cd rlhf
python3 train_model.py --config config.yamlFollowing the Uni-RLHF codebase implemeration, we modified the IQL, CQL and TD3BC algorithms. Furthermore, we added the DiffusionQL algorithm. You can adjust the details of the experiments in the TrainConfig objects in the algorithm implementations found in /algorithms/offline, as well as in the files in the /config directory.
Example: Train with implicit Q-learning. The log will be uploaded to wandb.
python3 algorithms/offline/iql_p.pyThese are the possible variations of algorithms, feedback types, label types, and auxiliary model types:
| Algorithm | Feedback Type | Label Type | Auxiliary Model Type |
|---|---|---|---|
| IQL | COMPARATIVE | CS | MLP |
| CQL | ATTRIBUTE | ST | TFM |
| TD3BC | EVALUATIVE | CNN | |
| DIFFUSIONQL | KEYPOINT |
Distributed under the MIT License. See LICENSE.txt for more information.
For any questions, please feel free to email yuanyf@tju.edu.cn.
If you find our work useful, please consider citing:
@inproceedings{anonymous2023unirlhf,
title={Uni-{RLHF}: Universal Platform and Benchmark Suite for Reinforcement Learning with Diverse Human Feedback},
author={Yuan, Yifu and Hao, Jianye and Ma, Yi and Dong, Zibin and Liang, Hebin and Liu, Jinyi and Feng, Zhixin and Zhao, Kai and Zheng, Yan}
booktitle={The Twelfth International Conference on Learning Representations, ICLR},
year={2024},
url={https://openreview.net/forum?id=WesY0H9ghM},
}
