-
-
Notifications
You must be signed in to change notification settings - Fork 12
Reinforcement Learning
- prediction vs. control
- MDP's vs. Bandits (one state, non-sequential)
- Tabular vs. Function Approximation
- On-Policy vs. Off-Policy
- Bootstrapping vs. Monte Carlo (unified by eligibility traces)
- Model-based vs. Model-free
- Value-based vs. Policy-based
And yet there is still a unified convergence
All the vs.'s above can be turned into and's
Evaluate a future. How much reward will agent get following a given policy?
- Given a policy
Optimise a future. What is the optimal policy for achieving the goal?
- Find the best policy
Often, we need to solve the prediction problem to solve the control problem
Whereas the reward signal (R
, a distribution over all states S
) indicates what is good in an immediate sense, a value function (V
, a distribution over all states S
) specifies what is good in the long run.
Roughly speaking, the value of a state is the total amount of reward an agent can expect to accumulate over the future, starting from that state.
Whereas rewards determine the immediate, intrinsic desirability of environmental states, values indicate the long-term desirability of states after taking into account the states that are likely to follow and the rewards available in those states.
"The most important component of almost all reinforcement learning algorithms we consider is a method for efficiently estimating values. The central role of value estimation is arguably the most important thing that has been learned about reinforcement learning over the last six decades." — Richard Sutton
- initialize a state-action table (array) 𝑄(𝑠,𝑎) arbitrarily (zeros, guesses)
- Choose actions in any way, perhaps based on 𝑄, such that all actions are taken in all state (infinitely often in the limit)
- On each time step, change one element of the array:
δ𝑄( St, At ) = α( Rt+1 + γmaxa 𝑄(St+1, a ) = 𝑄( St, At ) )
- If desired, reduce the step-size parameter α over time
Theorem: For an appropriate choice of reduction in α, 𝑄 converges to q*, and its greedy policy to an optimal policy π*
Learning optimal behavior without any model of the environment, for arbitrary MDPs!
Agent state is a feature vector
- xt = state_update(xt-1, At-1, Ot)
- Ot = Observation at time t
The Deep Q-Network variant (which is sometimes even referred to simply as Deep Q-Learning or just Q-Learning) replaces the state-action table with a neural network in order to cope with large-scale tasks, where the number of possible state-action pairs can be enormous.
how your agent learns the optimal policy
how your agent behaves.
Or, how i learned to stop worrying and escaped the explore/exploit dilemma
Off policy means learning about the value of a policy (Update Policy) other than the policy being used to generate the trajectory (Behavior Policy). Finds a better policy than On-policy learning, but with worse performance.
The reason that Q-learning is off-policy is that it updates its Q-values using the Q-value of the next state 𝑠′ and the greedy action 𝑎′. In other words, it estimates the return (total discounted future reward) for state-action pairs assuming a greedy policy were followed despite the fact that it's not following a greedy policy.
On policy means learning about the value of a policy (Update Policy) that is the same policy being used to generate the trajectory (Behavior Policy). On-policy methods estimate the value of a policy while using it for control.
On-policy methods perform better (more reward per episode) but find worse policies than Off-policy methods. (see cliff-walking example, Sutton, where Sarsa gets more reward per episode but Q-learning is better at finding optimal path)
The reason that SARSA is on-policy is that it updates its Q-values using the Q-value of the next state 𝑠′ and the current policy's action 𝑎″. It estimates the return for state-action pairs assuming the current policy continues to be followed.
The distinction disappears if the current policy is a greedy policy. However, such an agent would not be good since it never explores.
-
Why the name SARSA?: it is the only learning update that uses exactly these five things from the trajectory: St, At, Rt + 1, St + 1, At + 1
-
Sarsa is equivalent to the TD( 0 ) algorithm (Sutton 1988) when applied to state-action pairs rather than to states
Guaranteed under prediction and control when function approximation is linear:
- guaranteed convergent for when we want to know the value function of target policy ( Q* )(Sutton 1988, Dayan 1992, Tsitsiklis & Van Roy 1997)
- guaranteed non-divergent when want to find the best policy ( Qπ ), with bounded error (though may "chatter" — Gordon 1995)
For general non-linear function approximation, there is one known counterexample, but it is very artificial and contrived
- an ε-greedy policy is a stochastic policy that is usually greedy, but with small probability ε instead selects an option at random.
- hacky way to get exploration, as when acting at random, we will explore
Represent an action-value function by a paramaterized function approximator with parameter θ
q(s,a,Θ) ≈ q*(s,a) or qπ(s,a)
The approximator could be a deep neural network, with the weights being the parameter
- or simply a linear weighting of features
- Function approximation abstracts over state, but we also need to abstract over time
- There are several approaches, including framework of options — macro-actions of extended and variable duration that can nevertheless interoperate with primitive actions in DP planning methods and TD learning methods.
- The problems of temporal abstraction can be divided into three classes, increasing in difficulty:
- representing temporal abstractions (e.g. by options)
- learning temporal abstractions (e.g. by off-policy methods)
- discovering temporal abstractions and selecting among them, including lookaheads AND histories