Skip to content

Commit 2d9679e

Browse files
2 parents 90835ae + fc87bfe commit 2d9679e

File tree

2 files changed

+100
-13
lines changed

2 files changed

+100
-13
lines changed

Project-2_Continuous-Control/README.md

Lines changed: 16 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -109,12 +109,21 @@ To start training your own agent, all you have to do is to follow the instructio
109109
110110
# Future Work
111111
112-
While working on this project I dealt with many algorithms that can be used to solve this problem. Some of these algorithms I have already implemented such as DDPG(not provided here) & D4PG, and there are other algorithms like :
112+
While working on this project, I had to invest too much time in research to find the right algorithms for such a problem. There were many options available to me, and this was a challenge for me, and from here my journey began.
113113
114-
- Proximal policy optimization (PPO)
115-
- Asynchronous Advantage Actor-Critic (A3C)
116-
- Trust Region Policy Optimization (TRPO)
117-
118-
After implemnting these algorithms i will update this Repo and share the results with you :smiley:
119-
114+
There is really a very useful [repo](https://github.com/ShangtongZhang/DeepRL) that describes and implements different algorithms that work very well for such a problem with continuous action space. Thanks to this repo and other sources, I was able to understand some algorithms correctly, including the DDPG, D4PG, PPO, A2C, and A3C algorithms, and I was able to implement some of these algorithms to solve my problem.
115+
116+
Here are some Ideas for improvement:
117+
118+
* Implementing TRPO, PPO, A3C, A2C algorithms:
119+
120+
It is worthwhile to implement all these algorithms, so I will work on it in the next days and see which of these algorithms converges faster.
121+
122+
* Adjusting the Hyperparameters:
123+
124+
The more important step I can also take to improve the results and solve the problem with 100 episodes or even < 100 is to adjust the hyperparameters.
125+
126+
* Using prioritized experience replay and N-step techniques:
127+
128+
As mentioned in this paper https://openreview.net/forum?id=SyZipzbCb using these techniques with D4PG could potentially lead to better results
120129
Lines changed: 84 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -1,25 +1,23 @@
11
[//]: # (Image References)
22

33
[actor-critic]: Continuous-Control/images/actor-critic.png "ac"
4+
[d4pg]: Continuous-Control/images/d4pg.png "d4pg"
45

56

67
# Continuous Control
78

8-
While working on this project, I had to invest too much time in research to find the right algorithms for such a problem. There were many options available to me, and this was a challenge for me, and from here my journey began.
9-
10-
There is really a very useful [repo](https://github.com/ShangtongZhang/DeepRL) that describes and implements different algorithms that work very well for such a problem with continuous action space. Thanks to this repo and other sources, I was able to understand some algorithms correctly, including the DDPG, D4PG, PPO, A2C, and A3C algorithms, and I was able to implement some of these algorithms to solve my problem.
11-
129
In this report I will explain everything about this project in details. So we will look at different aspects like:
1310
- **Actor-Critic**
1411
- **Distributed distributional deep deterministic policy gradients (D4PG) algorithm**
1512
- **Model architectures**
1613
- **Hayperparameters**
14+
- **Result**
1715
- **Future Work**
1816

1917

2018
### Actor-Critic
2119

22-
Actor-critical algorithms are the basis behind almost every modern RL method like PPO, A3C and so on. So to understand all these new techniques, you definitely need a good understanding of what actor-critic is and how it works.
20+
Actor-critical algorithms are the basis behind almost every modern RL method like PPO, A3C and many more. So to understand all these new techniques, you definitely need a good understanding of what actor-critic is and how it works.
2321

2422
Let us first distinguish between value-based and policy-based methods:
2523

@@ -29,7 +27,87 @@ Each method has its advantages. For example, policy-based methods are better sui
2927

3028
Actor-critics aim to take advantage of all the good points of both the value-based and the policy-based while eliminating all their disadvantages.
3129

32-
![ac][actor-critic]
3330
The basic idea is to split the model into two parts: one to calculate an action based on a state and another to generate the Q-values of the action.
3431

32+
![ac][actor-critic]
33+
34+
35+
Actor : decides which action to take
36+
37+
Critic : tells the actor how good its action was and how it should adjust.
38+
39+
### Distributed distributional deep deterministic policy gradients (D4PG) algorithm
40+
41+
The core idea in this algorithm is to replace a single Q-value from the critic with N_ATOMS values, corresponding to the probabilities of values from the pre-defined range. The Bellman equation is replaced with the Bellman operator, which transforms this distributional representation in a similar way.
42+
43+
### Model architectures
44+
45+
**Actor Architecture**
46+
47+
Both Actor-Networks (local and target) consist of 3 fully-connected layers ( 2 hidden layers, 1 output layers) each of hidden layers followed by a Relu activation function and Batch Normalization layer.
48+
49+
The number of neurons of the fully-connected layers are as follows:
50+
51+
- fc1 , number of neurons: 400,
52+
- fc2 , number of neurons: 300,
53+
- fc3 , number of neurons: 4 (number of actions),
54+
55+
**Critic Architecture**
56+
57+
Both Critic-Networks (local and target) consist of 3 fully-connected layers ( 2 hidden layers, 1 output layers) each of hidden layers followed by a Relu activation function.
58+
59+
The number of neurons of the fully-connected layers are as follows:
60+
61+
- fc1 , number of neurons: 400,
62+
- fc2 , number of neurons: 300,
63+
- fc3 , number of neurons: 51 (number of atoms),
64+
65+
66+
### Hyperparameters
67+
68+
There were many hyperparameters involved in the experiment. The value of each of them is given below:
69+
70+
| Hyperparameter | Value |
71+
| ----------------------------------- | ----- |
72+
| Replay buffer size | 1e5 |
73+
| Batch size | 256 |
74+
| discount factor | 0.99 |
75+
| TAU | 1e-3 |
76+
| Actor Learning rate | 1e-3 |
77+
| Critic Learning rate | 1e-3 |
78+
| Update interval | 1 |
79+
| Update times per interval | 1 |
80+
| Number of episodes | 2000 (max) |
81+
| Max number of timesteps per episode | 1000 |
82+
| Number of atoms | 51 |
83+
| Vmin | -10 |
84+
| Vmax | +10 |
85+
86+
87+
### Result
88+
89+
The average reward is reached after 294 episodes.
90+
91+
![d4pg][d4pg]
92+
93+
94+
### Future Work
95+
96+
While working on this project, I had to invest too much time in research to find the right algorithms for such a problem. There were many options available to me, and this was a challenge for me, and from here my journey began.
97+
98+
There is really a very useful [repo](https://github.com/ShangtongZhang/DeepRL) that describes and implements different algorithms that work very well for such a problem with continuous action space. Thanks to this repo and other sources, I was able to understand some algorithms correctly, including the DDPG, D4PG, PPO, A2C, and A3C algorithms, and I was able to implement some of these algorithms to solve my problem.
99+
100+
Here are some Ideas for improvement:
101+
102+
* Implementing TRPO, PPO, A3C, A2C algorithms:
103+
104+
It is worthwhile to implement all these algorithms, so I will work on it in the next days and see which of these algorithms converges faster.
105+
106+
* Adjusting the Hyperparameters:
107+
108+
The more important step I can also take to improve the results and solve the problem with 100 episodes or even < 100 is to adjust the hyperparameters.
109+
110+
* Using prioritized experience replay and N-step techniques:
111+
112+
As mentioned in this paper https://openreview.net/forum?id=SyZipzbCb using these techniques with D4PG could potentially lead to better results
35113

0 commit comments

Comments
 (0)