In policy-based methods, we directly learn to approximate the optimal policy
The Reinforce algorithm, also called Monte-Carlo policy-gradient, is a policy-gradient algorithm that uses an estimated return from an entire episode to update the policy parameter
Mean_Reward: 500 +/- 0.00