From 12055312421390f0c0e4e118d146d94a0b59f2ea Mon Sep 17 00:00:00 2001 From: Mohammed Abdalqader <35113255+Mohammedabdalqader@users.noreply.github.com> Date: Thu, 28 May 2020 15:40:04 +0200 Subject: [PATCH] Update Report.md --- .../Report.md | 95 +++++++++++++++++++ 1 file changed, 95 insertions(+) diff --git a/Project-3_Collaboration_and_Competition/Report.md b/Project-3_Collaboration_and_Competition/Report.md index e69de29..19f1208 100644 --- a/Project-3_Collaboration_and_Competition/Report.md +++ b/Project-3_Collaboration_and_Competition/Report.md @@ -0,0 +1,95 @@ +[//]: # (Image References) + +[actor-critic]: ../Continuous-Control/images/actor-critic.png "AC" +[maddpg]: Collaboration_and_Competition/images/maddpg.png "MADDPG" + + +# Multi-Agent Collaboration and Competition + +In this report I will explain everything about this project in details. So we will look at different aspects like: +- **Actor-Critic** +- **MADDPG (Multi-Agent Deep Deterministic Policy Gradient) algorithm** +- **Model architectures** +- **Hayperparameters** +- **Result** +- **Future Work** + + +### Actor-Critic + +Actor-critical algorithms are the basis behind almost every modern RL method like PPO, A3C and many more. So to understand all these new techniques, you definitely need a good understanding of what actor-critic is and how it works. + +Let us first distinguish between value-based and policy-based methods: + +Value-based methods like Q-Learning and its extensions try to find or approximate the optimal value function, which is a mapping between an action and a value, while policy-based methods like Policy Gradients and REINFORCE try to find the optimal policy directly without the Q value. + +Each method has its advantages. For example, policy-based methods are better suited for continuous and stochastic environments, have faster convergence, while value-based methods are more efficient and stable. + +Actor-critics aim to take advantage of all the good points of both the value-based and the policy-based while eliminating all their disadvantages. + +The basic idea is to split the model into two parts: one to calculate an action based on a state and another to generate the Q-values of the action. + +![ac][actor-critic] + + +Actor : decides which action to take + +Critic : tells the actor how good its action was and how it should adjust. + +### Distributed distributional deep deterministic policy gradients (D4PG) algorithm + +The core idea in this algorithm is to replace a single Q-value from the critic with N_ATOMS values, corresponding to the probabilities of values from the pre-defined range. The Bellman equation is replaced with the Bellman operator, which transforms this distributional representation in a similar way. + +### Model architectures + +**Actor Architecture** + +Both Actor-Networks (local and target) consist of 3 fully-connected layers ( 2 hidden layers, 1 output layers) each of hidden layers followed by a Relu activation function and Batch Normalization layer. + +The number of neurons of the fully-connected layers are as follows: + +- fc1 , number of neurons: 400, +- fc2 , number of neurons: 300, +- fc3 , number of neurons: 4 (number of actions), + +**Critic Architecture** + +Both Critic-Networks (local and target) consist of 3 fully-connected layers ( 2 hidden layers, 1 output layers) each of hidden layers followed by a Relu activation function. + +The number of neurons of the fully-connected layers are as follows: + +- fc1 , number of neurons: 400, +- fc2 , number of neurons: 300, +- fc3 , number of neurons: 51 (number of atoms), + + +### Hyperparameters + +There were many hyperparameters involved in the experiment. The value of each of them is given below: + +| Hyperparameter | Value | +| ----------------------------------- | ----- | +| Replay buffer size | 1e5 | +| Batch size | 256 | +| discount factor | 0.99 | +| TAU | 1e-3 | +| Actor Learning rate | 1e-3 | +| Critic Learning rate | 1e-3 | +| Update interval | 1 | +| Update times per interval | 1 | +| Number of episodes | 2000 (max) | +| Max number of timesteps per episode | 1000 | +| Number of atoms | 51 | +| Vmin | -10 | +| Vmax | +10 | + + +# Results +| MADDPG (Multi-Agent Deep Deterministic Policy Gradient)| +| ---------- | +|![MADDPG][result]| + +# Future Work + +After 2 months with the excellent knowledge that this course has given us, I can say that I have taken a big step towards mastering this area. I am able to implement different algorithms and to select a suitable one for each problem. +In this project i have a chieved a very good result, in less than 200 episodes the target average reward achieved (> 0.50) and in 250 episode the average reward was 1.338 :muscle:. but I wonder if the performance will be better if I use prioritized experience replay? So I will work on it, and if it gives a better result, I will share the results with you :grinning: