In this repository we implement an agent that is trained to play the game lunar lander using i) an actor-critic algorithm, and ii) a (double) deep Q-learning algorithm. Here is a video of a trained agent playing the game:
video.mp4
We use the lunar lander implementation from gymnasium. For the implementation of the actor-critic algorithm we loosely follow Ref. [1]. While for the implementation of deep Q-learning we follow Ref. [2], for the implementation of double deep Q-learning we follow Ref. [3].
In the following, we first list the files contained in this repository and explain their usage. We then compare the training speed and post-training performance of agents trained using the actor-critic algorithm and deep Q-learning.
- agent_class.py: In this python file we implement the agent class which, which we use both for training an agent and acting with the trained agent
- train and visualize agent.ipynb: In this Jupyter notebook we train an agent, and subsequently create a gameplay video. The video at the beginning of this readme file was created with this notebook
- train_agent.py: In this python script we train an agent, and save both the trained agent parameters, as well as its training statistics to the disk
- run_agent.py: In this python script we run episodes for an already trained agent, and save statistics (duration of episodes, return for each episode) to the disk
- trained_agents/batch_train_and_run.sh: With this bash script we train 500 agents (via train_agent.py) and subsequently run 1000 episodes for each trained agent (via run_agent.py). The script by default runs 8 processes in parallel.
- trained_agents/plot_results.ipynb: In this Jupyter notebook we analyze the training statistics and performance of the trained agents from batch_train_and_run.sh, as summarized in this section.
With the script batch_train_and_run.sh we first train 500 agents and then run 1000 episodes for each agent using
- the actor-critic algorithm, and
- the deep q-learning (DQN) algorithm.
Here is a plot showing the distribution of the episodes needed for training for each scenario, along with the mean:
We observe that the distribution of episodes needed for training is more spread out for the actor-critic method. Furthermore, the actor-critic algorithm on average needed 28% more episodes to complete the training as compared to the DQN algorithm.
Here is a plot showing the actual runtime distribution of the respective 500 trainings:
On average, the actor-critic algorithm takes 67% longer to train as compared to deep Q-learning. Note that in the actor-critic algorithm we have twice as many parameters as compared to the Q-learning algorithm. This is because all neural networks we use are of equal size, and in the actor-critic algorithm we train 2 neural networks (namely the actor and the critic).
For each trained agent, we run 1000 episodes. Here is a plot of the resulting distribution of all episode returns for each scenario:
We observe that the distributions are rather similar. While the DQN agents perform slightly better on average (return of 227.4 for DQN vs 211.6 for actor-critic), the support of the agent-critic distributions returns extends a bit further to the right (high returns) of the plot.
To investigate this last point further, we for both algorithms select the agent that yielded the highest average return in its 1000 episodes, and plot the respective return distribution:
We see that the best actor-critic agent performs slightly better than the best DQN agent (mean return 261.9 vs. mean return 238.0).
We summarize:
- Compared to the actor-critic algorithm, the DQN algorithm yielded both a smaller mean and variance for the number of training episodes necessary for successful training. On average, the DQN algorithm also needed less time for training.
- The mean return of the trained DQN agents was slightly larger as compared to the actor-critic agents.
- However, the single best DQN agent performed a little bit worse than the single best actor-critic agent.
So overall, while the DQN algorithm on average trains faster and yields better mean performance, in our samples the best agent was obtained from the actor-critic algorithm.
In a more thorough study one might want to increase the number of trained agents to see whether our best actor being an actor-critic agent was a random fluctuation. Furthermore, one might want to vary the hyperparameters for training, so as to optimize the number of training episodes for each algorithm.
[1] Reinforcement Learning: An Introduction. Richard S. Sutton, Andrew G. Barto. http://incompleteideas.net/book/the-book.html.
[2] Playing Atari with Deep Reinforcement Learning. Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, Martin Riedmiller. arXiv:1312.5602.
[3] Deep Reinforcement Learning with Double Q-learning. Hado van Hasselt, Arthur Guez, David Silver. arXiv:1509.06461.