-
Notifications
You must be signed in to change notification settings - Fork 293
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Merge pull request #1518 from alo7lika/TEST
Proximal Policy Optimization (PPO)Algorithm in Machine Learning
- Loading branch information
Showing
2 changed files
with
178 additions
and
0 deletions.
There are no files selected for viewing
81 changes: 81 additions & 0 deletions
81
Machine_Learning_Algorithms/Proximal Policy Optimization (PPO)Algorithm /Program.c
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,81 @@ | ||
#include <stdio.h> | ||
#include <stdlib.h> | ||
#include <math.h> | ||
|
||
#define TRAJECTORY_LENGTH 100 | ||
#define NUM_TRAJECTORIES 10 | ||
#define CLIP_EPSILON 0.2 | ||
#define LEARNING_RATE 0.001 | ||
#define GAMMA 0.99 | ||
#define LAMBDA 0.95 | ||
|
||
// Placeholder functions for the neural network | ||
double policy(double state) { | ||
// Simple placeholder function for policy | ||
return state * 0.1; | ||
} | ||
|
||
double value_function(double state) { | ||
// Simple placeholder function for value function | ||
return state * 0.5; | ||
} | ||
|
||
// Calculate advantage function using Generalized Advantage Estimation (GAE) | ||
double calculate_advantage(double rewards[], double values[], int t) { | ||
double advantage = 0.0; | ||
double discount = 1.0; | ||
for (int k = t; k < TRAJECTORY_LENGTH; ++k) { | ||
advantage += discount * (rewards[k] + GAMMA * values[k + 1] - values[k]); | ||
discount *= GAMMA * LAMBDA; | ||
} | ||
return advantage; | ||
} | ||
|
||
// Policy update with clipping | ||
double clipped_objective(double ratio, double advantage) { | ||
double clip_value = fmax(1 - CLIP_EPSILON, fmin(1 + CLIP_EPSILON, ratio)); | ||
return fmin(ratio * advantage, clip_value * advantage); | ||
} | ||
|
||
// Main PPO loop | ||
void PPO() { | ||
double states[TRAJECTORY_LENGTH]; | ||
double actions[TRAJECTORY_LENGTH]; | ||
double rewards[TRAJECTORY_LENGTH]; | ||
double values[TRAJECTORY_LENGTH]; | ||
double advantages[TRAJECTORY_LENGTH]; | ||
double returns[TRAJECTORY_LENGTH]; | ||
|
||
for (int episode = 0; episode < NUM_TRAJECTORIES; ++episode) { | ||
// Simulate data collection | ||
for (int t = 0; t < TRAJECTORY_LENGTH; ++t) { | ||
states[t] = (double)t; // Placeholder state | ||
actions[t] = policy(states[t]); // Take action according to policy | ||
rewards[t] = -fabs(actions[t]); // Placeholder reward function | ||
values[t] = value_function(states[t]); | ||
} | ||
|
||
// Calculate returns and advantages | ||
for (int t = 0; t < TRAJECTORY_LENGTH; ++t) { | ||
returns[t] = rewards[t] + GAMMA * values[t + 1]; | ||
advantages[t] = calculate_advantage(rewards, values, t); | ||
} | ||
|
||
// Update policy using clipped objective | ||
for (int t = 0; t < TRAJECTORY_LENGTH; ++t) { | ||
double old_policy = policy(states[t]); | ||
double ratio = policy(states[t]) / old_policy; // Placeholder policy ratio | ||
double objective = clipped_objective(ratio, advantages[t]); | ||
|
||
// Simple gradient update (mock update, as no neural network here) | ||
// In practice, we would use neural network gradients | ||
double policy_update = LEARNING_RATE * objective; | ||
printf("Policy updated for state %f with value %f\n", states[t], policy_update); | ||
} | ||
} | ||
} | ||
|
||
int main() { | ||
PPO(); | ||
return 0; | ||
} |
97 changes: 97 additions & 0 deletions
97
Machine_Learning_Algorithms/Proximal Policy Optimization (PPO)Algorithm /README.md
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,97 @@ | ||
# Proximal Policy Optimization (PPO) Algorithm in Machine Learning | ||
|
||
--- | ||
|
||
## Description | ||
|
||
Proximal Policy Optimization (PPO) is an advanced reinforcement learning (RL) algorithm designed to help agents learn optimal policies in complex environments. Developed by OpenAI, PPO strikes a balance between complexity and performance, making it popular for applications in areas such as game playing, robotics, and autonomous control systems. PPO is a policy-gradient method that improves the stability and efficiency of training by using clipped objectives, allowing it to find near-optimal policies while preventing overly large updates that can destabilize learning. | ||
|
||
--- | ||
|
||
## Key Features | ||
|
||
1. **Policy Optimization with Clipping**: PPO restricts large policy updates by applying a clipping mechanism to the objective function, ensuring stable learning without drastic changes that could harm the performance. | ||
2. **Surrogate Objective Function**: PPO optimizes a surrogate objective that includes a penalty for large deviations from the old policy, reducing the risk of unstable updates. | ||
3. **On-Policy Learning**: PPO is primarily an on-policy algorithm, meaning it learns from data generated by the current policy, which improves sample efficiency and stability. | ||
4. **Trust Region-Free**: Unlike traditional Trust Region Policy Optimization (TRPO), PPO avoids complex constraints and uses simpler clipping methods for policy updates, making it computationally efficient. | ||
5. **Entropy Bonus**: The algorithm incorporates an entropy bonus to encourage exploration, helping the agent avoid local optima. | ||
|
||
--- | ||
|
||
## Problem Definition | ||
|
||
In reinforcement learning, an agent aims to learn an optimal policy, \( \pi(a|s) \), that maximizes expected cumulative rewards over time. The main challenges in policy optimization include: | ||
|
||
1. **Stability**: Large updates to policies can lead to drastic performance drops. | ||
2. **Sample Efficiency**: Efficient use of data is crucial, especially in complex environments with high-dimensional state and action spaces. | ||
3. **Exploration vs. Exploitation**: The agent needs to balance exploring new actions with exploiting known, rewarding actions. | ||
|
||
PPO addresses these challenges by refining the policy-gradient update approach through a clipped objective function, which stabilizes learning by controlling the impact of each update. | ||
|
||
--- | ||
|
||
## Algorithm Review | ||
|
||
### Steps of the PPO Algorithm: | ||
|
||
1. **Initialize** the policy network \( \pi_{\theta} \) and the value network \( V_{\phi} \) with random weights \( \theta \) and \( \phi \). | ||
2. **Generate Trajectories**: Using the current policy \( \pi_{\theta} \), generate multiple trajectories (i.e., sequences of states, actions, and rewards) by interacting with the environment. | ||
3. **Compute Rewards-to-Go**: For each state in a trajectory, compute the cumulative rewards-to-go, also known as the return, to approximate the true value function. | ||
4. **Compute Advantages**: Calculate the advantage function, which estimates how much better an action is than the average action in a given state. PPO often uses Generalized Advantage Estimation (GAE) for a more stable advantage computation. | ||
5. **Update the Policy with Clipping**: Use the surrogate objective function with a clipping factor to update the policy. The objective is given by: | ||
|
||
\[ | ||
L^{\text{CLIP}}(\theta) = \mathbb{E} \left[ \min \left( r_t(\theta) \hat{A}_t, \, \text{clip}\left( r_t(\theta), 1 - \epsilon, 1 + \epsilon \right) \hat{A}_t \right) \right] | ||
\] | ||
|
||
where: | ||
|
||
- \( r_t(\theta) = \frac{\pi_{\theta}(a|s)}{\pi_{\theta_{\text{old}}}(a|s)} \): the probability ratio between the new policy and the old policy. | ||
- \( \epsilon \): the clipping threshold. | ||
- \( \hat{A}_t \): the advantage function, which estimates the relative benefit of taking action \( a \) in state \( s \) compared to the average action in that state. | ||
|
||
6. **Update the Value Network**: Minimize the difference between the estimated values \( V_{\phi}(s) \) and the computed returns for more accurate value predictions. | ||
7. **Repeat**: Iterate steps 2-6 until convergence or a pre-defined number of episodes. | ||
|
||
--- | ||
|
||
## Time Complexity | ||
|
||
The time complexity of PPO mainly depends on: | ||
|
||
1. **Policy Network Forward Pass**: The forward pass complexity is given by: | ||
|
||
\[ | ||
O(N \cdot T \cdot P) | ||
\] | ||
|
||
where: | ||
- \( N \): number of trajectories. | ||
- \( T \): trajectory length. | ||
- \( P \): complexity of the policy network. | ||
|
||
2. **Gradient Update**: PPO typically requires several updates per episode, leading to a training complexity of: | ||
|
||
\[ | ||
O(E \cdot N \cdot T \cdot P) | ||
\] | ||
|
||
where \( E \) is the number of episodes. | ||
|
||
Overall, PPO has lower time complexity than more constrained methods like TRPO but requires more samples than off-policy algorithms like DDPG. | ||
|
||
--- | ||
|
||
## Applications | ||
|
||
PPO is widely used in applications that benefit from robust policy optimization, including: | ||
|
||
1. **Robotics**: Control tasks for robotic arms and autonomous agents. | ||
2. **Gaming**: Game AI that needs to learn complex behaviors in environments like chess, Go, and various video games. | ||
3. **Autonomous Vehicles**: Path planning and decision-making systems in self-driving cars. | ||
|
||
--- | ||
|
||
## Conclusion | ||
|
||
Proximal Policy Optimization (PPO) is a powerful reinforcement learning algorithm designed to stabilize the policy-gradient update process. Its clipped objective function provides a balance between exploration and exploitation, which improves stability, data efficiency, and convergence speed in various reinforcement learning applications. |