This is my implementation in TensorFlow of the [Advantage Actor Critic (A2C)] algorithm, a reinforcement learning algorithm that can learn to control high-dimensional continuous action spaces to maximize long-term reward in the agent's environment. Scroll down for a video of the results.
The agent is controlled by a deep convolutional policy network μ(s), which maps states to a specific action. This is updated to maximize the expected return predicted by the action-value network Q(s,a). In each state st the agent takes action at and receives a scalar reward rt whilst transitioning to state st+1. The state-action value can then be learned by the recursive relationship:
Qμ(st,at) = E(st+1∼E)[r(st,at) + γQμ(st+1, μ(st+1))]
- Action is repeated for 3 simulation timesteps, to allow the agent to infer velocities from frame differences.
- 3 convolutional layers shared by both critic and actor networks.
- 2 fully-connected layers of 500 nodes per critic and actor network.
- Batch normalization before the Relu activation of every hidden layer.
- Linear output layer for critic, and tanh/softmax activations for actor output.
- Target networks are used to give a stable target Q-value, which prevents the value network from diverging.
- Minibatches are sampled from experience replay memory which is stored in a circular buffer of recorded sequences.
- Experience replay sampling is prioritized according to TD error, so that learning is focused on samples with the most unexpected return value.
- Sequences can either be created by following the policy, or supplied by human expert demonstration to bootstrap the learning process.
On a GTX 1070, the following policy is learnt in the OpenAI environment CarRacing-v0:
I am experimenting with adding self-attention modules after the convolutional layers. Self-attention will allow the network to learn global relationships between entities in the scene, as opposed to just local coincidence detectors learnt by purely convolutional networks which only give translation invariance. This could allow the RL agent to better generalize to changes in the environment.
pip install gym # includes CarRacing-v0
pip install tensorflow-gpu
Learning is performed on the server, which accepts a client connection for each agent. First we start the server:
export ENV=CarRacing-v0
python rltf.py
tensorboard --logdir=/tmp/tf
Then we start one client instance that is on-policy, and one that takes an exploratory policy:
python rltf.py --inst 1 & python rltf.py --inst 2 --sample_action 0.1
# FlappyBird requires PLE and gym-ple:
export ENV=FlappyBird-v0
git clone https://github.com/ntasfi/PyGame-Learning-Environment.git
cd PyGame-Learning-Environment/
pip install -e .
git clone https://github.com/lusob/gym-ple.git
cd gym-ple/
pip install -e .
export ENV=AntBulletEnv-v0
pip install pybullet