You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: Project-2_Continuous-Control/README.md
+16-7Lines changed: 16 additions & 7 deletions
Original file line number
Diff line number
Diff line change
@@ -109,12 +109,21 @@ To start training your own agent, all you have to do is to follow the instructio
109
109
110
110
# Future Work
111
111
112
-
While working on this project I dealt with many algorithms that can be used to solve this problem. Some of these algorithms I have already implemented such as DDPG(not provided here) & D4PG, and there are other algorithms like :
112
+
While working on this project, I had to invest too much time in research to find the right algorithms for such a problem. There were many options available to me, and this was a challenge for me, and from here my journey began.
113
113
114
-
- Proximal policy optimization (PPO)
115
-
- Asynchronous Advantage Actor-Critic (A3C)
116
-
- Trust Region Policy Optimization (TRPO)
117
-
118
-
After implemnting these algorithms i will update this Repo and share the results with you :smiley:
119
-
114
+
There is really a very useful [repo](https://github.com/ShangtongZhang/DeepRL) that describes and implements different algorithms that work very well for such a problem with continuous action space. Thanks to this repo and other sources, I was able to understand some algorithms correctly, including the DDPG, D4PG, PPO, A2C, and A3C algorithms, and I was able to implement some of these algorithms to solve my problem.
115
+
116
+
Here are some Ideas for improvement:
117
+
118
+
* Implementing TRPO, PPO, A3C, A2C algorithms:
119
+
120
+
It is worthwhile to implement all these algorithms, so I will work on it in the next days and see which of these algorithms converges faster.
121
+
122
+
* Adjusting the Hyperparameters:
123
+
124
+
The more important step I can also take to improve the results and solve the problem with 100 episodes or even < 100 is to adjust the hyperparameters.
125
+
126
+
* Using prioritized experience replay and N-step techniques:
127
+
128
+
As mentioned in this paper https://openreview.net/forum?id=SyZipzbCb using these techniques with D4PG could potentially lead to better results
While working on this project, I had to invest too much time in research to find the right algorithms for such a problem. There were many options available to me, and this was a challenge for me, and from here my journey began.
9
-
10
-
There is really a very useful [repo](https://github.com/ShangtongZhang/DeepRL) that describes and implements different algorithms that work very well for such a problem with continuous action space. Thanks to this repo and other sources, I was able to understand some algorithms correctly, including the DDPG, D4PG, PPO, A2C, and A3C algorithms, and I was able to implement some of these algorithms to solve my problem.
11
-
12
9
In this report I will explain everything about this project in details. So we will look at different aspects like:
13
10
-**Actor-Critic**
14
11
-**Distributed distributional deep deterministic policy gradients (D4PG) algorithm**
15
12
-**Model architectures**
16
13
-**Hayperparameters**
14
+
-**Result**
17
15
-**Future Work**
18
16
19
17
20
18
### Actor-Critic
21
19
22
-
Actor-critical algorithms are the basis behind almost every modern RL method like PPO, A3C and so on. So to understand all these new techniques, you definitely need a good understanding of what actor-critic is and how it works.
20
+
Actor-critical algorithms are the basis behind almost every modern RL method like PPO, A3C and many more. So to understand all these new techniques, you definitely need a good understanding of what actor-critic is and how it works.
23
21
24
22
Let us first distinguish between value-based and policy-based methods:
25
23
@@ -29,7 +27,87 @@ Each method has its advantages. For example, policy-based methods are better sui
29
27
30
28
Actor-critics aim to take advantage of all the good points of both the value-based and the policy-based while eliminating all their disadvantages.
31
29
32
-
![ac][actor-critic]
33
30
The basic idea is to split the model into two parts: one to calculate an action based on a state and another to generate the Q-values of the action.
34
31
32
+
![ac][actor-critic]
33
+
34
+
35
+
Actor : decides which action to take
36
+
37
+
Critic : tells the actor how good its action was and how it should adjust.
38
+
39
+
### Distributed distributional deep deterministic policy gradients (D4PG) algorithm
40
+
41
+
The core idea in this algorithm is to replace a single Q-value from the critic with N_ATOMS values, corresponding to the probabilities of values from the pre-defined range. The Bellman equation is replaced with the Bellman operator, which transforms this distributional representation in a similar way.
42
+
43
+
### Model architectures
44
+
45
+
**Actor Architecture**
46
+
47
+
Both Actor-Networks (local and target) consist of 3 fully-connected layers ( 2 hidden layers, 1 output layers) each of hidden layers followed by a Relu activation function and Batch Normalization layer.
48
+
49
+
The number of neurons of the fully-connected layers are as follows:
50
+
51
+
- fc1 , number of neurons: 400,
52
+
- fc2 , number of neurons: 300,
53
+
- fc3 , number of neurons: 4 (number of actions),
54
+
55
+
**Critic Architecture**
56
+
57
+
Both Critic-Networks (local and target) consist of 3 fully-connected layers ( 2 hidden layers, 1 output layers) each of hidden layers followed by a Relu activation function.
58
+
59
+
The number of neurons of the fully-connected layers are as follows:
60
+
61
+
- fc1 , number of neurons: 400,
62
+
- fc2 , number of neurons: 300,
63
+
- fc3 , number of neurons: 51 (number of atoms),
64
+
65
+
66
+
### Hyperparameters
67
+
68
+
There were many hyperparameters involved in the experiment. The value of each of them is given below:
69
+
70
+
| Hyperparameter | Value |
71
+
| ----------------------------------- | ----- |
72
+
| Replay buffer size | 1e5 |
73
+
| Batch size | 256 |
74
+
| discount factor | 0.99 |
75
+
| TAU | 1e-3 |
76
+
| Actor Learning rate | 1e-3 |
77
+
| Critic Learning rate | 1e-3 |
78
+
| Update interval | 1 |
79
+
| Update times per interval | 1 |
80
+
| Number of episodes | 2000 (max) |
81
+
| Max number of timesteps per episode | 1000 |
82
+
| Number of atoms | 51 |
83
+
| Vmin | -10 |
84
+
| Vmax | +10 |
85
+
86
+
87
+
### Result
88
+
89
+
The average reward is reached after 294 episodes.
90
+
91
+
![d4pg][d4pg]
92
+
93
+
94
+
### Future Work
95
+
96
+
While working on this project, I had to invest too much time in research to find the right algorithms for such a problem. There were many options available to me, and this was a challenge for me, and from here my journey began.
97
+
98
+
There is really a very useful [repo](https://github.com/ShangtongZhang/DeepRL) that describes and implements different algorithms that work very well for such a problem with continuous action space. Thanks to this repo and other sources, I was able to understand some algorithms correctly, including the DDPG, D4PG, PPO, A2C, and A3C algorithms, and I was able to implement some of these algorithms to solve my problem.
99
+
100
+
Here are some Ideas for improvement:
101
+
102
+
* Implementing TRPO, PPO, A3C, A2C algorithms:
103
+
104
+
It is worthwhile to implement all these algorithms, so I will work on it in the next days and see which of these algorithms converges faster.
105
+
106
+
* Adjusting the Hyperparameters:
107
+
108
+
The more important step I can also take to improve the results and solve the problem with 100 episodes or even < 100 is to adjust the hyperparameters.
109
+
110
+
* Using prioritized experience replay and N-step techniques:
111
+
112
+
As mentioned in this paper https://openreview.net/forum?id=SyZipzbCb using these techniques with D4PG could potentially lead to better results
0 commit comments