Using PyBullet to simulate quadrupedal robotic movement and PyTorch to implement a RL DDPG Agent in the simulation.
Using ROS2 Humble to port actions to real robot through UART.
STATUS: I have been working on this project for 3 months now, I will take a break and revisit with other improvements hopefully in the near future!
The motivation of this project is robotics exploration and control system development. Robotics are fascinating and this being the first legitimate robot that I have created/implemented by myself brought engineering design challenges that I have yet to encounter in my previous experiences.
This project was a collaboration with my roommate. Majority of the CAD compenents and assembly was done by him. Using RL to walk for this task (walking in a straight line on a flat surface) is excessive, but the purpose of using RL is to potentially improve it in the future. The model can be improved to navigate different terrains and at different degrees of motion, similar to the SOLO12 or other quadrupedal implementations.
The reward function will be based on the position of the chassis. Unlike the Solo12 or other high-level quadrupeds using joint-independent reward functions (hierarchal policy gradient), this quadruped will use the chassis' RPY + velocity.
DDPG (Deterministic Deep Policy Gradients) will be the approach for this problem. It is an actor-critic algorithm, the actor being a policy gradient and the critic being a DQN or Q-Network. The actor trains off the DQNs results and the rewards of its actions. DDPGs support multiple continuous actions which is perfect for a robotics application such as making a quadruped walk.
Agent will be provided chassis RPY and joint positions. Agent will generate action consisting of position control values for each DOF every 60 timesteps or every .25 seconds. Simulation will reset every 10 seconds (each episode is 10 seconds).
First step:
Current model:
** The centre of balance is off since the battery is not installed, my fingers lifts it up but no force is applied forward.
-
The current physical robot would need to be rebuilt and redesigned. The robot would have to be redesigned with PyBullet and the training model in mind, especially with respect to friction and the ability to "drag" the feet.
-
The DDPG model works best for "ideal" quadrupeds IMO. Models that I have seen online use DDPG, but their simulations are not realistic (feet are "sticky" and centre of mass is directly in the middle of the robot). This allows random exploration to be much more successful. Instead of a random exploration policy (Ornstien-Uhlenbeck), an Epsilon Greedy Policy could be implemented instead. Instead of DDPG maybe a A2C or A3C (still Q-Learning w/ Policy Gradient), or try with a Multidimensional PPO. There are plenty of RL algorithms which could be effective, but more research would be required.
Using Python 3.10.12 and ROS2 Humble.
Setup for Training and Simulation:
git clone <repo>
sudo apt-get install -y python3-venv
mkdir venv && cd venv
python -m venv .
source lib/activate
cd ../sim/src/moving_model
python moving_model.py
If you are having issues with setting venv and ROS2, it is a known issue with no fix yet.
ROS:
source <ros_distro>/install/setup.bash
source ros/install/setup.bash
colcon build
ros2 run model talker
** I ran into a lot of issues running a venv w/ ROS2, better to use rosdep instead. If I were to redo this project, I would purely use rosdep instead of venv to keep dependencies in one place.
-
During research, my first problem was that most PGs are essentially Markov Chains which will have only 1 probabilistic action for each given observation. Vanilla Policy Gradients and the REINFORCE algorithm are simple on-policy algorithms that I started with. The probabilities for each action in these algorithms are generated from the agent and the action with the highest probability is selected. The negative values are multiplied with standardized rewards and summed as the loss where SGD optimizes it. This is insufficient for the robot as position values are required + deterministic action.
Multiple approaches can be taken to get a 12-continuous action space from the internet. DDPG and Hierarchal Policy Gradients are popular approaches to solve this problem. Starcraft 2 agent experiment by OpenAI uses A3C, which can also handle large action spaces and has an incredibly complex agent structure.
It is possible to map the stochastic policy to the min and maxes of each joint and try to do one action at a time. However, if I were to do this, it would be a wiser use of time to implement DDPG as the robot would be deterministic and would be able to have multiple actions at a time. DDPG only supports continuous action spaces with a deterministic policy, perfect for this application. -
Curse of dimensionality for this problem as well. From DQN to PGs this problem remains as deterministic or stochastic problems can have trouble mapping the network output to discrete action spaces. For the quadruped robot, the joint can move around 1.07 rad each which could be discretized to 0.2 rad since both DQNs and PGs can only produce 1 discrete action (to move 0.2 rad or not to move 0.2 rad). Would this 0.2 rad be too much or too little? If we were to do this for each joint it would increase exponentially, thus a continuous action space would be required. (actually, PGs can bypass this problem by interpreting the output as
$\ log(\theta) $ to sample any number, instead of the output of$\ \mu $ see this.) -
During initial phases of training, I found many errors in the simulation that prevented an effective policy being learned by the agent. At this point, errors could be from both PyBullet sim and the implementation of the DDPG. Through playing with the simulation in moving_robot.py, I found that friction and the normal force on each joint was not enough to push the robot forward. The center of gravity of the robot caused tipping of the robot. Edits to the CAD files and simulation tweaking was required.
-
DDPG convergence issues. This is a foreseeable problem with DDPGs as they tend to converge to absolute values (1 or 0) if a policy is not found through the random OU function. I found that if the agent did not find a suitable policy with the random function within the first 400 iterations, it would not learn a non-trivial policy. Playing with parameters with the OU function to make sure that the agent converged well and changing the simulation i.e changing materials, center of balance, and contact points to eventually learn a decent policy.
-
PyBullet provides 3 ways to control the joints. Positional, velocity and torque control. I chose positional at the beginning due to it being easy to map with the action space of the agent. However, with the nature of stepping the environment, the actions are very choppy and not entirely smooth. The resulting optimal policy from this control scheme is having 2 feet move of the 4 to maintain balance in the robot during non-action simulation steps. I started exploring with velocity and torque control since they can result in continuous movements.
reddit
git
stack
torcs_ddpg
ddg_blog
Baeldung
Inspiration articles from OpenAI and Andrei Karpathy.