This is the code for our paper,
- Jaekyeom Kim*, Seohong Park* and Gunhee Kim (*equal contribution). Unsupervised Skill Discovery with Bottleneck Option Learning. In ICML, 2021. [paper] [talk] [slides]
It includes the implementation for IBOL, specifically, the linearizer, the skill discovery method on top of it and the downstream tasks for the evaluation of them.
If you find our work or this code useful in your research, please cite
@inproceedings{kim2021_ibol,
title={Unsupervised Skill Discovery with Bottleneck Option Learning},
author={Kim, Jaekyeom and Park, Seohong and Kim, Gunhee},
booktitle={International Conference on Machine Learning (ICML)},
year={2021}
}
We show some example skills discovered by IBOL in four MuJoCo environments without rewards.
This code is tested in environments with the following conditions:
- Ubuntu 16.04 machine
- CUDA-compatible GPUs
- Python 3.7.10
- Install MuJoCo version 2.0 binaries, following the instructions. Note that they offer multiple licensing choices including the 30-day free trials.
- At the top-level directory, run the following command to set up the environment.
pip install --no-deps -r requirements.txt
Command | Environment |
---|---|
python tests/main.py --train_type linearizer --env ant |
Ant |
python tests/main.py --train_type linearizer --env half_cheetah |
HalfCheetah |
python tests/main.py --train_type linearizer --env hopper |
Hopper |
python tests/main.py --train_type linearizer --env humanoid |
Humanoid |
python tests/main.py --train_type linearizer --env dkitty_randomized |
D'Kitty Randomized |
Command | Environment |
---|---|
python tests/main.py --train_type skill_discovery --env ant --cp_path "exp/L_ANT/sampling_policy.pt" |
Ant |
python tests/main.py --train_type skill_discovery --env half_cheetah --cp_path "exp/L_HC/sampling_policy.pt" |
HalfCheetah |
python tests/main.py --train_type skill_discovery --env hopper --cp_path "exp/L_HP/sampling_policy.pt" |
Hopper |
python tests/main.py --train_type skill_discovery --env humanoid --cp_path "exp/L_HUM/sampling_policy.pt" |
Humanoid |
python tests/main.py --train_type skill_discovery --env dkitty_randomized --cp_path "exp/L_DK/sampling_policy.pt" |
D'Kitty Randomized |
Command | Environment |
---|---|
python tests/main.py --train_type downstream --env ant_goal --cp_path "exp/L_ANT/sampling_policy.pt" --dcp_path "exp/S_ANT/option_policy.pt" |
AntGoal |
python tests/main.py --train_type downstream --env ant_multi_goals --cp_path "exp/L_ANT/sampling_policy.pt" --dcp_path "exp/S_ANT/option_policy.pt" |
AntMultiGoals |
python tests/main.py --train_type downstream --env half_cheetah_goal --cp_path "exp/L_CH/sampling_policy.pt" --dcp_path "exp/S_CH/option_policy.pt" |
CheetahGoal |
python tests/main.py --train_type downstream --env half_cheetah_imi --cp_path "exp/L_CH/sampling_policy.pt" --dcp_path "exp/S_CH/option_policy.pt" |
CheetahImitation |
- Each training command stores its results in an experiment directory under
exp/
. - In each experiment directory,
plots/
(files) ortb_plot/
(tensorboard) contain qualitative visualizations. - For downstream tasks, the column named
TrainSp/IOD/SmoothedReward500
inprogress.csv
can be examined.
This code is based on garage.