Deep Reinforcement Learning (DRL) Implementation

This repository contains implementations of various deep reinforcement learning algorithms, focusing on fundamental concepts and practical applications.

Project Structure

It is recommended to follow the material in the given order.

Model Free Learning

Discrete State Problems

Monte Carlo Methods

Implementation of Monte Carlo (MC) algorithms using the Blackjack environment as an example:

MC Prediction
- First-visit MC prediction for estimating action-value function
- Policy evaluation with stochastic limit policy
MC Control with Incremental Mean
- GLIE (Greedy in the Limit with Infinite Exploration)
- Epsilon-greedy policy implementation
- Incremental mean updates
MC Control with Constant-alpha
- Fixed learning rate approach
- Enhanced control over update process

Temporal Difference Methods

Implementation of TD algorithms on both Blackjack and CliffWalking environments:

SARSA (On-Policy TD Control)
- State-Action-Reward-State-Action
- On-policy learning with epsilon-greedy exploration
- Episode-based updates with TD(0)
Q-Learning (Off-Policy TD Control)
- Also known as SARSA-Max
- Off-policy learning using maximum action values
- Optimal action-value function approximation
Expected SARSA
- Extension of SARSA using expected values
- More stable learning through action probability weighting
- Combines benefits of SARSA and Q-Learning

Continuous State Problems

Uniform Discretization

Q-Learning (Off-Policy TD Control)
- Q-Learning to the MountainCar environment using discretized state spaces
- State space discretization through uniform grid representation for continuous variables
- Exploration of the impact of discretization granularity on learning performance

Tile Coding Discretization

Q-Learning (Off-Policy TD Control) with Tile Coding
- Q-Learning applied to the Acrobot environment using tile coding for state space representation
- Tile coding as a method to efficiently represent continuous state spaces by overlapping feature grids

Model Based Learning

Value Based Iteration

Vanilla Deep Q Network

Deep Q Network with Experience Replay (DQN)
- A neural network is used to approximate the Q-value function $Q(s, a)$.
- Breaks the temporal correlation of samples by randomly sampling from a replay buffer.
- Periodically updates the target network's parameters to reduce instability in target value estimation.

Variants of Deep Q Network

Double Deep Q Network with Experience Replay (DDQN)
- Addresses the overestimation bias in vanilla DQN by decoupling action selection and evaluation.
- This decoupling helps stabilize training and improves the accuracy of Q-value estimates.
Prioritized Double Deep Q Network (Prioritized DDQN)
- Enhances the efficiency of experience replay by prioritizing transitions with higher temporal-difference (TD) errors.
- Combines the stability of Double DQN with prioritized sampling to focus on more informative experiences.
Dueling Double Deep Q Network (Dueling DDQN)
- Introduces a new architecture that separates the estimation of state value $V(s)$ and advantage function $A(s, a)$
- Improves learning efficiency by explicitly modeling the state value $V(s)$, which captures the overall "desirability" of actions
- Works particularly well in environments where some actions are redundant or where the state value $V(s)$ plays a dominant role in decision-making.
Noisy Dueling Prioritized Double Deep Q-Network (Noisy DDQN)
- Combines Noisy Networks, Dueling Architecture, Prioritized Experience Replay, and Double Q-Learning into a single framework.
- Noisy Networks replace ε-greedy exploration with parameterized noise, enabling more efficient exploration by learning stochastic policies.
- Dueling Architecture separates the estimation of state value $V(s)$ and advantage function $A(s, a)$, improving learning efficiency.
- Prioritized Experience Replay focuses on transitions with higher temporal-difference (TD) errors, enhancing sample efficiency.
- Double Q-Learning reduces overestimation bias by decoupling action selection from evaluation.
- This combination significantly improves convergence speed and stability, particularly in environments with sparse or noisy rewards.

Asynchronous Deep Q Network

Asynchronous One Step Deep Q Network without Experience Replay (AsyncDQN)
- Eliminates the dependency on experience replay by using asynchronous parallel processes to interact with the environment and update the shared Q-network.
- Achieves significant speedup by leveraging multiple CPU cores, making it highly efficient even without GPU acceleration.
- Compared to Dueling DDQN (22 minutes), AsyncDQN completes training in just 4.29 minutes on CPU, achieving a 5x speedup.
Asynchronous One Step Deep SARSA without Experience Replay (AsyncDSARSA)
- Utilizes same asynchronous parallel processes to update a shared Q-network without the need for experience replay.
- Employs a one-step SARSA—on-policy update rule that leverages the next selected action to enhance stability and reduce overestimation (basically same as AsyncDQN).
Asynchronous N-Step Deep Q Network without Experience Replay (AsyncNDQN)
- Extends AsyncDQN by incorporating N-step returns, which balances the trade-off between bias (shorter N) and variance (longer N).
- N-step returns accelerate the propagation of rewards across states, enabling faster convergence compared to one-step updates.
- Like AsyncDQN, it eliminates the dependency on experience replay, using asynchronous parallel processes to update the shared Q-network.

Policy Based Iteration

Black Box Optimization

Hill Climbing
- A simple optimization technique that iteratively improves the policy by making small adjustments to the parameters.
- Relies on evaluating the performance of the policy after each adjustment and keeping the changes that improve performance.
- Works well in low-dimensional problems but can struggle with local optima and high-dimensional spaces.
Cross Entropy Method (CEM)
- A probabilistic optimization algorithm that searches for the best policy by iteratively sampling and updating a distribution over policy parameters.
- Particularly effective in high-dimensional or continuous action spaces due to its ability to focus on promising regions of the parameter space.
- Often used as a baseline for policy optimization in reinforcement learning.

Policy Gradient Methods

REINFORCE
- A foundational policy gradient algorithm that directly optimizes the policy by maximizing the expected cumulative reward.
- Uses Monte Carlo sampling to estimate the policy gradient.
- Updates the policy parameters based on the gradient of the expected reward with respect to the policy.
Improved REINFORCE
- Parallel collection of multiple trajectories and allows the policy gradient to be estimated by averaging across different trajectories, leading to more stable updates.
- Rewards are normalized to stabilize learning and ensure consistent gradient step sizes.
- Credit assignment is improved by considering only the future rewards for each action and reduces gradient noise without affecting the averaged gradient, leading to faster and more stable training.
Proximal Policy Optimization (PPO)
- Introduces a clipped surrogate objective to ensure stable updates by preventing large changes in the policy.
- Balances exploration and exploitation by limiting the policy ratio deviation within a trust region.
- Combines the simplicity of REINFORCE with the stability of Trust Region Policy Optimization (TRPO), making it efficient and robust for large-scale problems.

Environments Brief in This Project

Blackjack: Classic card game environment for policy learning
CliffWalking: Grid-world navigation task with negative rewards and cliff hazards
Taxi-v3: Grid-world transportation task where an agent learns to efficiently navigate, pick up and deliver passengers to designated locations while optimizing rewards.
MountainCar: Continuous control task where an underpowered car must learn to build momentum by moving back and forth to overcome a steep hill and reach the goal position.
Acrobot: A two-link robotic arm environment where the goal is to swing the end of the second link above a target height by applying torque at the actuated joint. It challenges agents to solve nonlinear dynamics and coordinate the motion of linked components efficiently.
LunarLander: A physics-based environment where an agent controls a lunar lander to safely land on a designated pad. The task involves managing fuel consumption, balancing thrust, and handling the dynamics of gravity and inertia.
PongDeterministic-v4: A classic Atari environment where the agent learns to play Pong, a two-player game where the objective is to hit the ball past the opponent's paddle. The Deterministic-v4 variant ensures fixed frame-skipping, making the environment faster and more predictable for training. This environment is commonly used to benchmark reinforcement learning algorithms, especially for discrete action spaces.

Requirements

Create (and activate) a new environment with Python 3.10 and install Pytorch with version PyTorch 2.5.1

conda create -n DRL python=3.10
conda activate DRL

Installation

Clone the repository:

git clone https://github.com/deepbiolab/drl.git
cd drl

Install dependencies:

pip install -r requirements.txt

Usage

Exmaple: Monte Carlo Methods

Run the Monte Carlo implementation:

cd monte-carlo-methods
python monte_carlo.py

Or explore the detailed notebook:

Future Work

Comprehensive implementations of fundamental RL algorithms
- MC Control (Monte-Carlo Control)
- MC Control with Incremental Mean
- MC Control with Constant-alpha
- SARSA
- SARSA Max (Q-Learning)
- Expected SARSA
- Q-learning with Uniform Discretization
- Q-learning with Tile Coding Discretization
- DQN
- DDQN
- Prioritized DDQN
- Dueling DDQN
- Async One Step DQN
- Async N Step DQN
- Async One Step SARSA
- Distributional DQN
- Noisy DQN
- Rainbow
- Hill Climbing
- Cross Entropy Method
- REINFORCE
- PPO
- A3C
- A2C
- DDPG
- MCTS
- AlphaZero

Name		Name	Last commit message	Last commit date
Latest commit History 140 Commits
assets		assets
model-based-learning		model-based-learning
model-free-learning		model-free-learning
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Deep Reinforcement Learning (DRL) Implementation

Project Structure

Model Free Learning

Discrete State Problems

Monte Carlo Methods

Temporal Difference Methods

Continuous State Problems

Uniform Discretization

Tile Coding Discretization

Model Based Learning

Value Based Iteration

Vanilla Deep Q Network

Variants of Deep Q Network

Asynchronous Deep Q Network

Policy Based Iteration

Black Box Optimization

Policy Gradient Methods

Environments Brief in This Project

Requirements

Installation

Usage

Exmaple: Monte Carlo Methods

Future Work

About

Releases

Packages

Languages

License

deepbiolab/drl

Folders and files

Latest commit

History

Repository files navigation

Deep Reinforcement Learning (DRL) Implementation

Project Structure

Model Free Learning

Discrete State Problems

Monte Carlo Methods

Temporal Difference Methods

Continuous State Problems

Uniform Discretization

Tile Coding Discretization

Model Based Learning

Value Based Iteration

Vanilla Deep Q Network

Variants of Deep Q Network

Asynchronous Deep Q Network

Policy Based Iteration

Black Box Optimization

Policy Gradient Methods

Environments Brief in This Project

Requirements

Installation

Usage

Exmaple: Monte Carlo Methods

Future Work

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages