Using Double Deep Q Networks with experience replay to solve Cartpole v0 in just 184 episodes. Implemented using tensorflow 2.
- Double Q Learning was first introduced in this paper by DeepMind.
- The main idea is to use two estimators instead of just one, thereby decoupling action selection from target estimation.
- Experience replay is used to reduce correlation between training samples for the agent.
- This also improves data efficiency as previous experiences are used to train the agent.
- Instead of hard updating(copying the primary network's parameters) the target network, Polyak averaging is used to "blend" the target network with the primary network.
- CartPole-v0 defines "solving" as getting average reward of 195.0 over 100 consecutive trials.
- Only the primary network is trained, the target network is just "soft" updated using the primary network every episode.
- I find that a batch size of 16 is faster than the usual 32.
- The use of learning rate decay is key to ensuring convergence. Without this we see that though the maximum reward is obtained the task is never solved :
- The initial value of epsilon is set as 0.5. This decays as the agent is trained, with its minimum value being 0.01.
- The Agent class from contains the implementation, training is done on cartpole.ipynb (on Google Colab).
- Cartpole v0 is solved after 184 training episodes.
- Graphs of reward and mean average of reward over 100 consecutive trials :