This repository contains the implementation of a Proximal Policy Optimization (PPO) agent to control a humanoid in the OpenAI Gymnasium Mujoco environment. The agent is trained to master complex humanoid locomotion using deep reinforcement learning.
The clip above showcases the performance of the PPO agent in the Humanoid-v5 environment after about 1000 epochs of training.
To get started with this project, follow these steps:
-
Clone the Repository:
git clone https://github.com/ProfessorNova/PPO-Humanoid.git cd PPO-Humanoid
-
Set Up Python Environment: Make sure you have Python installed (tested with Python 3.10.11). It's recommended to create a virtual environment to avoid dependency conflicts. You can use
venv
orconda
for this purpose. -
Install Dependencies: Run the following command to install the required packages:
pip install -r req.txt
For proper PyTorch installation, visit pytorch.org and follow the instructions based on your system configuration.
-
Install Gymnasium Mujoco: You need to install the Mujoco environment to simulate the humanoid:
pip install gymnasium[mujoco]
-
Train the Model: To start training the model, run:
python train_ppo.py
This creates the folders
checkpoints
,logs
, andvideos
in the root of the repository. Thecheckpoints
folder will contain the model checkpoints, thelogs
folder will contain the TensorBoard logs, and thevideos
folder will contain the recorded videos of the agent's performance. -
Monitor Training Progress: You can monitor the training progress by viewing the videos in the
videos
folder or by looking at the graphs in TensorBoard:tensorboard --logdir "logs"
To run the pre-trained PPO model, execute the following command (make sure you followed the installation steps above):
python test_ppo.py
This will load the pre-trained model for the root of the repository (model.pt
) and run it in the Humanoid-v5
environment.
You can customize the training by modifying the command-line arguments:
python train_ppo.py --n-envs <number_of_envs> --n-epochs <number_of_epochs> ...
All hyperparameters can be viewed either with python train_ppo.py --help
or by looking at the
parse_args_ppo
function in lib/utils.py
.
The training process mainly involves the following components:
- lib/agent_ppo.py: Contains the PPO agent implementation, including the policy and value networks and the necessary methods for sampling actions, getting log probabilities and entropy, as well as the values from the value network.
- lib/buffer_ppo.py: Implements the replay buffer to store experiences and sample batches for training. It also handles the GAE (Generalized Advantage Estimation) for calculating advantages.
- lib/utils.py: Contains utility functions for parsing command-line arguments, setting up the environment, and creating recordings of the agent's performance.
- train_ppo.py: The main script for training the PPO agent. It initializes the environment, agent, and buffer, and handles the training loop.
The following charts provide insights into the performance during training with the current default hyperparameters:
The average reward per step basically indicates how fast the humanoid is moving.
The graph starts with a quick increase in the reward, which is expected as the agent learns to not instantly fall over. After that, the reward stays relatively stable, with some fluctuations. After about 500 epochs, it starts to increase significantly, indicating that the agent has learned to walk and tries to go faster and faster.
If the reward drops temporarily, it does not necessarily mean that the agent is performing worse. It can also be due to the agent learning to stabilize and thus moving not as fast per step.
The knowledge to implement this project was mainly acquired from the following book:
- Deep Reinforcement Learning Hands-On by Maxim Lapan