- Suppose that our environment has a 4-dimensional continuous state space and two discrete actions.
- So the actor network takes 4D state vectors and returns the logits of an action given a state.
- In other words, the actor network returns the distribution of an action given a state — which is the same as a policy function's role.
- And the critic network returns the value of state.
- To train our actor network, first of all, we should define the loss function.
- It is define as follow :
- where
$\pi_\theta$ is our new policy, and$\pi_{\theta_k}$ is the old policy.
- So we can decompose our loss function as follow :
- In this loss function equation, there's a kind of regularization for new policy (
$\pi_\theta$ ). - When the advantage is positive, the objective will increase if the action becomes more likely—that is, if
$\pi_{\theta}(a|s)$ increases. - But the min in this term puts a limit to how much the objective can increase.
- Once
$\pi_{\theta}(a|s) > (1+\epsilon) \pi_{\theta_k}(a|s) $ , the min kicks in and this term hits a ceiling of$(1+\epsilon) A^{\pi_{\theta_k}}(s,a)$ . - Thus, the new policy does not benefit by going far away from the old policy.
- Likewise, when the advantage is negative, the objective will increase if the action becomes less likely—that is, if
$\pi_{\theta}(a|s)$ decreases. - But the max in this term puts a limit to how much the objective can increase.
- Once
$\pi_{\theta}(a|s) < (1-\epsilon) \pi_{\theta_k}(a|s)$ , the max kicks in and this term hits a ceiling of$(1-\epsilon) A^{\pi_{\theta_k}}(s,a)$ . - Thus, again the new policy does not benefit by going far away from the old policy.
- In our example code, our advantage estimate is computed as follow:
- where
$r_t + \gamma V_\phi(s_{t+1})$ is the estimated state$s_t$ value which is based on the Bellman equation.
- Our reward-to-go estimate is computed as follow:
- where the reward
$r_i$ is collected by our agent's journey. - Basically our critic network, or value function, targets to predict a state value.
- And a state value is basically an expectation of a return, and the return is a discounted sum of future rewards.
- So, it makes sense to define the critic-net's loss as
$\text{MSE}(V_\phi(s_t),~ \hat R_t)$
- OpenAI Spinning Up : Proximal Policy Optimization.
- Illias Chrysovergis (2021). "Proximal Policy Optimization " (Keras Tutorial).
- John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, Oleg Klimov (OpenAI) (2017). "Proximal Policy Optimization Algorithms".