Overview: Policy-Based Methods

In policy-based methods, we directly learn to approximate the optimal policy $\pi ^ *$ without having to learn a value function. We parameterize the policy by using an neural network which will output a probability distribution over actions (Stochastic Policy). Policy-gradient methods is a subclass of policy-based methods in which we search directly for the optimal policy. We optimize the parameter directly by performing the gradient ascent on the performance of the objective function.

Reinforce Algorithm

The Reinforce algorithm, also called Monte-Carlo policy-gradient, is a policy-gradient algorithm that uses an estimated return from an entire episode to update the policy parameter $\theta$. The following is an implementation of the Reinforce algorithm in the CartPole-v1 environment.

Results

Mean_Reward: 500 +/- 0.00

replay.mp4

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Overview: Policy-Based Methods

Reinforce Algorithm

Results

Files

README.md

Latest commit

History

README.md

File metadata and controls

Overview: Policy-Based Methods

Reinforce Algorithm

Results