Code for several RL algorithms used in the following papers:

"Improving Policy Gradient by Exploring Under-appreciated Rewards" by Ofir Nachum, Mohammad Norouzi, and Dale Schuurmans.
"Bridging the Gap Between Value and Policy Based Reinforcement Learning" by Ofir Nachum, Mohammad Norouzi, Kelvin Xu, and Dale Schuurmans.
"Trust-PCL: An Off-Policy Trust Region Method for Continuous Control" by Ofir Nachum, Mohammad Norouzi, Kelvin Xu, and Dale Schuurmans.

Available algorithms:

Actor Critic
TRPO
PCL
Unified PCL
Trust-PCL
PCL + Constraint Trust Region (un-published)
REINFORCE
UREX

Requirements:

TensorFlow (see http://www.tensorflow.org for how to install/upgrade)
OpenAI Gym (see http://gym.openai.com/docs)
NumPy (see http://www.numpy.org/)
SciPy (see http://www.scipy.org/)

Quick Start:

Run UREX on a simple environment:

python trainer.py --logtostderr --batch_size=400 --env=DuplicatedInput-v0 \
  --validation_frequency=25 --tau=0.1 --clip_norm=50 \
  --num_samples=10 --objective=urex

Run REINFORCE on a simple environment:

python trainer.py --logtostderr --batch_size=400 --env=DuplicatedInput-v0 \
  --validation_frequency=25 --tau=0.01 --clip_norm=50 \
  --num_samples=10 --objective=reinforce

Run PCL on a simple environment:

python trainer.py --logtostderr --batch_size=400 --env=DuplicatedInput-v0 \
  --validation_frequency=25 --tau=0.025 --rollout=10 --critic_weight=1.0 \
  --gamma=0.9 --clip_norm=10 --replay_buffer_freq=1 --objective=pcl

Run PCL with expert trajectories on a simple environment:

python trainer.py --logtostderr --batch_size=400 --env=DuplicatedInput-v0 \
  --validation_frequency=25 --tau=0.025 --rollout=10 --critic_weight=1.0 \
  --gamma=0.9 --clip_norm=10 --replay_buffer_freq=1 --objective=pcl \
  --num_expert_paths=10

Run Mujoco task with TRPO:

python trainer.py --logtostderr --batch_size=25 --env=HalfCheetah-v1 \
  --validation_frequency=5 --rollout=10 --gamma=0.995 \
  --max_step=1000 --cutoff_agent=1000 \
  --objective=trpo --norecurrent --internal_dim=64 --trust_region_p \
  --max_divergence=0.05 --value_opt=best_fit --critic_weight=0.0 \

To run Mujoco task using Trust-PCL (off-policy) use the below command. It should work well across all environments, given that you search sufficiently among

(1) max_divergence (0.001, 0.0005, 0.002 are good values),

(2) rollout (1, 5, 10 are good values),

(3) tf_seed (need to average over enough random seeds).

python trainer.py --logtostderr --batch_size=1 --env=HalfCheetah-v1 \
  --validation_frequency=250 --rollout=1 --critic_weight=1.0 --gamma=0.995 \
  --clip_norm=40 --learning_rate=0.0001 --replay_buffer_freq=1 \
  --replay_buffer_size=5000 --replay_buffer_alpha=0.001 --norecurrent \
  --objective=pcl --max_step=10 --cutoff_agent=1000 --tau=0.0 --eviction=fifo \
  --max_divergence=0.001 --internal_dim=256 --replay_batch_size=64 \
  --nouse_online_batch --batch_by_steps --value_hidden_layers=2 \
  --update_eps_lambda --nounify_episodes --target_network_lag=0.99 \
  --sample_from=online --clip_adv=1 --prioritize_by=step --num_steps=1000000 \
  --noinput_prev_actions --use_target_values --tf_seed=57

Run Mujoco task with PCL constraint trust region:

python trainer.py --logtostderr --batch_size=25 --env=HalfCheetah-v1 \
  --validation_frequency=5 --tau=0.001 --rollout=50 --gamma=0.99 \
  --max_step=1000 --cutoff_agent=1000 \
  --objective=pcl --norecurrent --internal_dim=64 --trust_region_p \
  --max_divergence=0.01 --value_opt=best_fit --critic_weight=0.0 \
  --tau_decay=0.1 --tau_start=0.1

Maintained by Ofir Nachum (ofirnachum).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Files

README.md

Latest commit

History

README.md

File metadata and controls