-
Q Learning is a type of unsupervised reinforcement learning technique.
-
Reinforcement Learning is a class of machine learning algorithms in which the the program determines the ideal behavior within a specific context based on reward feedback for a transition from a given state to another.
In this tutorial we learn how to find an optimal path from one node of a graph to another using Q-learning. But before we get into that, here is a little motivation and theoretical introduction:
Q-learning algorithms were used in Google's DeepMind to develop the so called "deep Q-networks".Using only pixel data and the game score as inputs, this algorithm learned 49 Atari 2600 games and proved to be as succesful as expert humans. A paper about the same was published in Nature in February 2015.(linked here : http://www.nature.com/nature/journal/v518/n7540/full/nature14236.html - look at the videos linked on the website!)
More specifically, Q-learning is used to find an optimal action selection policy for a given finite Markov Decision Process(MDP).We shall now see what is a Markov Decision Process.
A Markov Decision Process model contains:
- Set of world states S
- Set of possible actions A
- A real valued reward function R(s,a) where
$s \in S$ $a \in A $ - A description T of each action's effects in each state
This decision process satisfies the Markov Property which means that the effects of the action taken in a state depend on that state only, and not any prior state or action in the process. An MDP is a "discrete time stochastic control process" which means that we specify a probability distribution to the all the states reachable from a given state by performing a specific action. More precisely, T: S X A -> Prob(S), which represents the distribution P(s' | s, a)
A policy
Following a policy is what you think it means: given a state s select and execute an action
Which policy is the best policy?
We can evalute a policy by finding the expected value of the total reward. To compare different policies with infinite expected values we use an objective function (reward function, which is essentially a negative loss function) that maps an infinte sequence of rewards to a single real value.We can create this function by discounting to favour "earlier" rewards. More formally, for a given discount rate
Notice that an infinite sum
Before we proceed our analysis, we define a value funtion (objective function)
Thus this value function maps every policy to a real "utility" value, with which we can order the policies, atleast partially. The functions are partially ordered because some policies can result in the same real value. This model ensures that ATLEAST one optimal policy exists: the one with the maximum expected reward. Also notice that every optimal policy will have the same value function, say V*.
We can mathematically define the value function as:
Naturally, there is only one V* for every s in S. These equations are known as Bellman Equations.
The Q-learning algorithm is just an iterative value updation form of the value function(
The algorithm will be simplified and further explained in the example prose below.
In this tutorial I will illustrate how this algorithm, i.e, finding an optimal action selection policy for a finite MDP, works to find the best path from a one vertex in a graph to another. The problem formulation and images below are taken from here. To solidy the problem more, consider a building with 5 rooms, with some doors that connect the rooms, while some of the them open up to the outside of the building. Our goal in this problem is to find a way out of the building, given we are placed in a given room. Specifically, consider this set up:
Clearly, this can be represented as a graph with each room as a vertex and a door as a bi-directional edge. And so our problem is to find a graph path to the goal state, which is represented by vertex 5 in the following graph repsentation of the building:
Now, notice that world of states, S, is the set of vertices, which represent the rooms. The set of possible actions, A, is represented by the outgoing edges from a vertex, which correspond to travelling from one room to another through the door connecting the two. And so, next, we must assign a Reward function R(s,a). We set R(s,a) = 100 for all those actions which take s to the goal state = 5, and 0 otherwise. These rewards become the edge weights, as represented by the following graph:
Notice that a graph of this sort can be represented by a 5x5 reward matrix in which the row index, i, represents the current state, column index, j, the final state by going through the door connecting the room i to the room j, and the value at (i,j) the reward. Since we knew all the states a priori, we can initialize this 5x5 matrix, but even we didnt know all the states, we can just program the code to add another column when it discovers a new state.This type of code is capable of handling a large amount of unkown states and probably is more data driven. We represent the impossible actions(non-existent edges), like going directly from room 2 to 4, by 'None' values in the matrix.
import random
# This is the reward/environment matrix, representing the graph
# of states to states transitions, with the associated
# reward for the given transition (action). A row represents
# the current state, and the columns in that row represent the
# possible states you can reach and their associated reward.
r = [[None, None, None, None, 0, None],
[None, None, None, 0, None, 100],
[None, None, None, 0, None, None],
[None, 0, 0, None, 0, None],
[0, None, None, 0, None, 100],
[None, 0, None, None, 0, 100]]
r2 = []
#We can also initialize the goal state here:
goal_state = 5
Next, we define the Q-matrix, that will hold the reward value at postion (i,j) for the corresponding action as calculated by our algorithm. Every iteration of the Q-learning algorithm is referred to as an 'episode' which begins with selecting a random state and ends the moment our 'graph seach' or traversal of the DFS reaches the goal state. In simpler terms, our algorithm is is as follows:
-
Set the gamma parameter, and environment rewards in matrix R.
-
Initialize matrix Q to zero.
-
For each episode:
Select a random initial state.
Do While the goal state hasn't been reached:
A. Select one among all possible actions for the current state. B. Using this possible action, consider going to the next state. C. Get maximum Q value for this next state based on all possible actions. D. Compute: Q(state, action) = R(state, action) + Gamma * Max[Q(next state, all actions)] E. Set the next state as the current state.
End Do
End For
And thus this Q-matrix will represent the policy
#INITIALIZE THE Q-MATRIX
q = None
def init_q():
# Assumes r is a square matrix
l = len(r)
# Matrix to be initialized
global q
q = []
# Create a matrix the same size as r
# initialized with all 0's
for row in xrange(l):
q_inner = []
for col in xrange(l):
append_value = 0.0
q_inner.append(append_value)
rown = q_inner
q.append(rown)
result = q
return result
We build our agorithm from a top-down approach, but write the code down-up.
Here we define a function possible_actions that takes in current state and returns a list of (reachable state,reward for taking that path)
def possible_actions(current_state):
# Grab the possible actions for the current state
row = r[current_state]
# Includes the possible actions, with their rewards.
current_actions = []
next_state = 0
# Remove the transitions that are not possible,
# and add in the next_state, reward pair.
for reward in row:
condition = not(reward == None)
if (condition):
append_value = (next_state, reward)
current_actions.append(append_value)
next_state += 1
result = current_actions
return result
Next, define choose_action that selects the next action from all possible actions for a given state. which is either optimal (if e = 0) or suboptimal (if e>0) because e The function takes in current_state,current_actions(returned from possible_actions) and the parameters e and g.
- e or epsilon is used in the action selection policies in such a way that it represents a percentage of times we wish to choose an action randomly, i.e.,we want the algorithm to 'explore' more states and not follow a deterministic 'optimal' path. In other words, e represents the probability of selecting a random action at a given state, versus selecting the most optimal action. Hence if e is closer to 1, a more random action selection policy is used, and the agent is more of an 'explorer' than 'exploiter'. If closer to 0,it is the opposite.
- g, as discussed in the theory, is the gamma constant. If g is closer to 1, the algorithm is more likely to optimize for future rewards versus immediate rewards. If closer to 0, then it will look more heavily to gain immediate rewards(only looking at the current reward if 0)and if vice versa if closer to 1.
The function returns a tuple of the next_state, reward.
def choose_action(current_state, current_actions, exploring,e,g):
# Find the best action to take from the list.
best_next_state = None
best_reward = None
best_update = None
# Loop through to find the best action for the given state.
for (next_state, reward) in current_actions:
# Calculate the Q-value for this state, and action pair.
q_update = r[current_state][next_state] + g * max(q[next_state])
# See if it is the most optimal so far
isNone = (best_update == None)
betterUpdate = (q_update > best_update)
condition = isNone or betterUpdate
if (condition):
best_update = q_update
best_reward = reward
best_next_state = next_state
# Now we will choose either the optimal action, or
# a random action depending on the probability
# e. e->1 means high probability of exploring and vice versa for e->0.
optimal = (best_next_state, best_reward)
# To do that, create a list of booleans that
# determines whether or not to take the optimal,
# or a random action. If a True is chosen,
# take the optimal, else take a random action.
choice_list = []
number_random = int(e * 100)
choice_list = []
for i in xrange(number_random):
append_value = False
choice_list.append(append_value)
for i in xrange(100-number_random):
append_value = True
choice_list.append(append_value)
# Select a choice randomly, with the probability e that a random
# selection will be made.
index = random.randint(0, 99)
choice = choice_list[index]
if (choice or not exploring):
result = optimal
return result
else:
random_choice = random.randint(0, len(current_actions)-1)
result = current_actions[random_choice]
return result
run_episodes is the top-level function for the algorithm that makes the final Q-matrix. It takes in the parameters episodes,e,and g.
Recall, episodes are the number of runs of our algorithm desired to train the Q-Matrix through
unsupervised learning. An episode begins at a random state and ends when the current_state = goal_state.
It returns an optimized Q-matrix, which represents an action policy
def run_episodes(episodes,e,g):
# Initialize q to all 0's
init_q()
assert(q != None)
# Useful variables
l = len(r)
# Run through the episodes
for episode in xrange(episodes):
# Select a random initial state
current_state = random.randint(0, l-1)
# Reached goal state
exploring = True
while (exploring):
# Check if this is the last state
if (current_state == goal_state):
exploring = False
# Grab the possible transition pairs that
# contain (state, reward).
current_actions = possible_actions(current_state)
# This function selects an action from the current_actions
# list and takes it based upon the selection policy. If exploring
# is false, then automatically take the best action because we are in
# the goal_state.
next_state, reward = choose_action(current_state, current_actions, exploring,e,g)
# Modify the q matrix based on the selected action.
q_update = reward + g * max(q[next_state])
q[current_state][next_state] = q_update
# Take the action, and set the current_state to the new state
current_state = next_state
return q
def run_session():
g = 0.8
e = 0.9
print "g = ", g
print "e = ", e
print "goal_state = ", goal_state
episodes_list = [1,2,5,100,1000,10000]
# Run the training.
for episodes in episodes_list:
q = run_episodes(episodes,e,g)
print "Q-matrix after ", episodes, " episodes"
print str(q)
run_session()
g = 0.8
e = 0.9
goal_state = 5
Q-matrix after 1 episodes
[[0.0, 0.0, 0.0, 0.0, 0.0, 0.0], [0.0, 0.0, 0.0, 0.0, 0.0, 100.0], [0.0, 0.0, 0.0, 0.0, 0.0, 0.0], [0.0, 0.0, 0.0, 0.0, 0.0, 0.0], [0.0, 0.0, 0.0, 0.0, 0.0, 0.0], [0.0, 0.0, 0.0, 0.0, 0.0, 100.0]]
Q-matrix after 2 episodes
[[0.0, 0.0, 0.0, 0.0, 0.0, 0.0], [0.0, 0.0, 0.0, 0.0, 0.0, 180.0], [0.0, 0.0, 0.0, 64.0, 0.0, 0.0], [0.0, 0.0, 0.0, 0.0, 80.0, 0.0], [0.0, 0.0, 0.0, 64.0, 0.0, 100.0], [0.0, 0.0, 0.0, 0.0, 0.0, 180.0]]
Q-matrix after 5 episodes
[[0.0, 0.0, 0.0, 0.0, 0.0, 0.0], [0.0, 0.0, 0.0, 64.0, 0.0, 336.1600000000001], [0.0, 0.0, 0.0, 64.0, 0.0, 0.0], [0.0, 51.2, 0.0, 0.0, 144.0, 0.0], [0.0, 0.0, 0.0, 0.0, 0.0, 244.0], [0.0, 0.0, 0.0, 0.0, 0.0, 336.1600000000001]]
Q-matrix after 100 episodes
[[0.0, 0.0, 0.0, 0.0, 399.99999987268535, 0.0], [0.0, 0.0, 0.0, 319.9999998010708, 0.0, 499.99999987268535], [0.0, 0.0, 0.0, 319.99999968917314, 0.0, 0.0], [0.0, 399.9999998408567, 255.99999939291612, 0.0, 399.99999905143136, 0.0], [319.9999998408567, 0.0, 0.0, 319.9999998010708, 0.0, 499.9999998981483], [0.0, 0.0, 0.0, 0.0, 0.0, 499.9999998981483]]
Q-matrix after 1000 episodes
[[0.0, 0.0, 0.0, 0.0, 400.0, 0.0], [0.0, 0.0, 0.0, 320.0, 0.0, 500.0], [0.0, 0.0, 0.0, 320.0, 0.0, 0.0], [0.0, 400.0, 256.0, 0.0, 400.0, 0.0], [320.0, 0.0, 0.0, 320.0, 0.0, 500.0], [0.0, 0.0, 0.0, 0.0, 0.0, 500.0]]
Q-matrix after 10000 episodes
[[0.0, 0.0, 0.0, 0.0, 400.0, 0.0], [0.0, 0.0, 0.0, 320.0, 0.0, 500.0], [0.0, 0.0, 0.0, 320.0, 0.0, 0.0], [0.0, 400.0, 256.0, 0.0, 400.0, 0.0], [320.0, 0.0, 0.0, 320.0, 0.0, 500.0], [0.0, 0.0, 0.0, 0.0, 0.0, 500.0]]
As you can see the values of the Q-Matrix converge as the number of episodes increase. The Q-matrix optimized for g = 0.8 is: We can normalize the values of the Q-Matrix with the highest reward(500) to get: Finally, to illustrate the shortest path from room 2 to 5 using the Q-Matrix values:
This represents the path (2,3,1,5) or (2,3,4,5), which are equivalent and correct (look at the pictures of the room at the top to verify).
# Another run with g = 0 to show that this will only consider the immediate reward so no Q-learning will happen
def run_session2():
g = 0.0
e = 0.9
print "g = ", g
print "e = ", e
print "goal_state = ", goal_state
episodes_list = [1,2,5,100,1000,10000]
# Run the training.
for episodes in episodes_list:
q = run_episodes(episodes,e,g)
print "Q-matrix after ", episodes, " episodes"
print str(q)
run_session2()
g = 0.0
e = 0.9
goal_state = 5
Q-matrix after 1 episodes
[[0.0, 0.0, 0.0, 0.0, 0.0, 0.0], [0.0, 0.0, 0.0, 0.0, 0.0, 0.0], [0.0, 0.0, 0.0, 0.0, 0.0, 0.0], [0.0, 0.0, 0.0, 0.0, 0.0, 0.0], [0.0, 0.0, 0.0, 0.0, 0.0, 100.0], [0.0, 0.0, 0.0, 0.0, 0.0, 100.0]]
Q-matrix after 2 episodes
[[0.0, 0.0, 0.0, 0.0, 0.0, 0.0], [0.0, 0.0, 0.0, 0.0, 0.0, 0.0], [0.0, 0.0, 0.0, 0.0, 0.0, 0.0], [0.0, 0.0, 0.0, 0.0, 0.0, 0.0], [0.0, 0.0, 0.0, 0.0, 0.0, 100.0], [0.0, 0.0, 0.0, 0.0, 0.0, 100.0]]
Q-matrix after 5 episodes
[[0.0, 0.0, 0.0, 0.0, 0.0, 0.0], [0.0, 0.0, 0.0, 0.0, 0.0, 100.0], [0.0, 0.0, 0.0, 0.0, 0.0, 0.0], [0.0, 0.0, 0.0, 0.0, 0.0, 0.0], [0.0, 0.0, 0.0, 0.0, 0.0, 100.0], [0.0, 0.0, 0.0, 0.0, 0.0, 100.0]]
Q-matrix after 100 episodes
[[0.0, 0.0, 0.0, 0.0, 0.0, 0.0], [0.0, 0.0, 0.0, 0.0, 0.0, 100.0], [0.0, 0.0, 0.0, 0.0, 0.0, 0.0], [0.0, 0.0, 0.0, 0.0, 0.0, 0.0], [0.0, 0.0, 0.0, 0.0, 0.0, 100.0], [0.0, 0.0, 0.0, 0.0, 0.0, 100.0]]
Q-matrix after 1000 episodes
[[0.0, 0.0, 0.0, 0.0, 0.0, 0.0], [0.0, 0.0, 0.0, 0.0, 0.0, 100.0], [0.0, 0.0, 0.0, 0.0, 0.0, 0.0], [0.0, 0.0, 0.0, 0.0, 0.0, 0.0], [0.0, 0.0, 0.0, 0.0, 0.0, 100.0], [0.0, 0.0, 0.0, 0.0, 0.0, 100.0]]
Q-matrix after 10000 episodes
[[0.0, 0.0, 0.0, 0.0, 0.0, 0.0], [0.0, 0.0, 0.0, 0.0, 0.0, 100.0], [0.0, 0.0, 0.0, 0.0, 0.0, 0.0], [0.0, 0.0, 0.0, 0.0, 0.0, 0.0], [0.0, 0.0, 0.0, 0.0, 0.0, 100.0], [0.0, 0.0, 0.0, 0.0, 0.0, 100.0]]
The matrix representation of a graph is very memory inefficient because we have 0's for all the impossible actions. We could instead map only the exisent actions to their Q-value using a dictionary.
Detailed paper about general Q-learning techniques: http://www.gatsby.ucl.ac.uk/~dayan/papers/cjch.pdf
Paper about choosing the correct 'e' parameter using Bayesian Methods in order to balance 'exploration' vs. 'exploitation': http://www.cs.huji.ac.il/~nir/Papers/DFR1.pdf
https://en.wikipedia.org/wiki/Q-learning
http://reinforcementlearning.ai-depot.com/