This project focuses on training an agent using Proximal Policy Optimization (PPO) within the Bipedal Walker environment. The environment simulates a bipedal robot with 4 joints and 2 legs, challenging the agent to traverse rough terrain. Both normal and hardcore modes are implemented.
- About Bipedal Walker
- Project Structure
- Training Process
- Normal and Hardcore modes with PPO
- Environment Setup
- 3.1 make_env()
- 3.2 observe_model()
- Model Evaluation
- Training Logs and Analysis
- Improvements
- Installation Requirements
- Credits
The Bipedal Walker environment, based on the Box2D physics engine, simulates a bipedal robot navigating various terrains. The challenge for the agent is to maintain balance, coordination, and locomotion in the face of obstacles.
- Observation Space: 24 continuous values including hull angles, velocities, joint angles, and LIDAR readings.
- Action Space: 4 continuous values controlling the torque applied to the hip and knee joints.
- Rewards: Positive for forward movement, negative for excessive joint torque and falling.
- Termination: When the agent falls or exceeds the step limit (1600 steps for normal mode, 2000 for hardcore).
- main.py: Contains the training loop for both normal and hardcore modes.
- env_utils.py: A utility script that configures the Bipedal Walker environment with optional features such as frame stacking, video recording, and reward normalization.
- logs/: Directory for storing training logs.
- models/: Directory for saving trained PPO models.
- videos/: If enabled, recorded video episodes will be saved here.
- Timesteps: 1 million
- Environment: Standard Bipedal Walker (
BipedalWalker-v3
) - Techniques: Vectorized environments, reward normalization, frame stacking, video recording.
- Model: PPO with a Multi-Layer Perceptron (MLP) policy.
- Timesteps: 5 million
- Environment: Hardcore Bipedal Walker (
BipedalWalkerHardcore-v3
) - Techniques: Same as normal mode with more challenging terrain and increased training duration.
The training uses Stable Baselines3's PPO algorithm and runs with vectorized environments for parallel training.
The environment is set up using two key functions from the env_utils.py script:
The make_env()
function prepares the environment for training and evaluation with several configurable options:
-
Environment Creation: By default, the environment created is
BipedalWalker-v3
. However, you can enable the hardcore mode by passinghardcore=True
to switch toBipedalWalkerHardcore-v3
. -
Render Mode: The environment can be rendered in different modes, such as 'human' for real-time visualization or 'rgb_array' for video recording.
-
Video Recording: If
record_video=True
is set, the environment records every 1000 steps and saves the recordings in the specified folder. -
Monitor: The environment can be wrapped with a monitor to log performance metrics such as rewards and episode lengths. These logs are useful for analyzing the training process later.
-
Vectorized Operations: To speed up training,
DummyVecEnv
is used to enable parallel processing of multiple environment instances. -
Observation & Reward Normalization: The environment is wrapped with
VecNormalize
to stabilize training by normalizing both observations and rewards. This helps the agent learn more effectively. -
Frame Stacking: The last
n
frames (by default, 4) can be stacked usingVecFrameStack
, providing the agent with temporal context, which is crucial for environments like Bipedal Walker that require an understanding of movement dynamics over time. -
Clip Observations: You can clip observations to avoid outliers during training by setting
clip_obs
to a certain value (default: 10.0).
env = make_env(env_name="BipedalWalker-v3", hardcore=True, record_video=True, use_monitor=True)
The observe_model() function loads a trained PPO model and evaluates it in the specified environment. It automatically checks if VecNormalize and VecFrameStack were used during training and applies them accordingly.
-
Model Loading: The trained model is loaded from the specified file path.
-
Environment Setup: Depending on whether hardcore mode is enabled, the environment BipedalWalker-v3 or BipedalWalkerHardcore-v3 is selected.
-
VecNormalize & VecFrameStack: If these wrappers were used during training, they are applied to the evaluation environment to ensure consistent behavior.
-
Evaluation: The model is evaluated over a specified number of episodes, and the mean and standard deviation of the rewards are returned.
mean_reward, std_reward = observe_model(model_path='models/ppo_bipedalwalker_1M', n_eval_episodes=5, hardcore=False)
This setup ensures that the environment is optimized for both training and evaluation, providing flexibility with advanced features like video recording, reward normalization, and frame stacking.
Model evaluation is performed across multiple episodes using the observe_model()
function, which loads the trained model and runs it in human-render mode for visualization.
- Normal Mode:
Average reward: 248.39 ± 112.10
- Hardcore Mode (3M):
Average reward: -28.23 ± 24.82
- Hardcore Mode (5M):
Average reward: -10.66 ± 3.91
- Hardcore Mode (7M):
Average reward: -5.45 ± 2.10
These results show that the agent performs relatively well in the normal environment but struggles in the hardcore version, where further training or parameter tuning may be needed.
Training logs from the 5 million hardcore timesteps are analyzed for insights into agent performance:
- Reward Trend: The reward shows fluctuations but tends to stabilize over time.
- Episode Length Trend: The agent consistently learns to survive longer as training progresses, though there are occasional dips.
- Correlation: A strong positive correlation (0.89) between reward and episode length, indicating that the longer the agent survives, the more reward it earns.
Visualizations such as reward trends and episode length moving averages are generated using pandas
and matplotlib
.
Recommendations for improving the agent's performance:
- Adjust Learning Rate: A smaller learning rate may lead to more stable improvements.
- Reward Restructuring: Incentivize the agent to prioritize survival and balance over forward movement.
- Increased Exploration: Methods such as ε-greedy or curiosity-driven exploration can help the agent learn more diverse strategies.
- Extended Training: Additional timesteps can provide the agent with more experience and lead to better policies.
To install the necessary dependencies, use the provided requirements.txt
:
pip install -r requirements.txt
Dependencies include:
• Python 3.8+
• gymnasium for the environment
• stable-baselines3 for the PPO implementation
• pandas and matplotlib for log analysis and visualizations
This project is based on the work of Oleg Klimov, adapted for PPO training using Stable Baselines3.