Agent who chooses between
First version of K-Armed Bandit is Random Agent, where he chooses arms randomly. So the
Next option is
Optimistic Agent at the begining has big value at each arm. He chooses the best arm at each iteration so the values becomes smaller and smaller, but at the same time he explore many arms.
Upper Confidence Bound Agent chooses the best
Agent is in the labyrinth. Every step he takes, he gets negative reward. Moreover, at each place he gets reward associated with this state. The clue of the problem is to find the best way to exit. He chooses the best policy to get less pushment score.
Agent remembers all states and actions. He gets reward if he win and pushment when he loss. At each step he use determined policy decide if he hit or draw and learns new policy based on the statistic of number of win and loss.
Agent has to get from the start to exit of the grid, but there is cliff so when he gets to the cliff he drops he get high punishment and has to start walking from the start. At each step he gets
The clue is to find the best linear function which describe model. We used cost function and the goal is to minimize the cost. The best model will be found using gradient descent. We should continue algorithm until the cost stops decreasing.