Skip to content

ostad-ai/Reinforcement-Learning

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

33 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Reinforcement Learning (RL)

(under construction)

  1. Introduction: Reinforcement Learning: Introduction

RL: Essential elements

Figure 1. Reinforcement Learning: Essential elements.
  1. Markov Decision Process: A Markov decision process (MDP) is a mathematical framework used for modeling decision-making. We have an agent (e.g., the robot) that interacts with an environment by taking actions, transitioning between states, and receiving rewards. Here, we review the concepts and formulae of MDPs and remind the Markov property of them. Finally, we give an example in Python for a simple grid world.

RL: State-action-next state-reward

Figure 2. The agent at state st takes action at. Then, the environment goes to state st+1 and gives reward rt+1.
  1. Returns, policy, and value functions Return is the discounted accumulated rewards that an agent receives over time. The policy is a strategy that the agent uses to select its action at each state. The value functions are the expected returns. We review these technical terms and give an example in Python.
  2. Bellman equations, Bellman optimality equations, and optimal policy: With the Bellman equations, we get the Bellman optimality equations. With the Bellman optimality equations, we are able to estimate the optimal value functions. Consequently, the optimal policy can be obtained from the optimal value functions. This is one way to solve an RL problem. Here, a Python code is available for using Bellman optimality equation of state-value function. It is reminded that the Python code is a bit advanced at this stage.
  3. Value Iteration: The value iteration is a model-based algorithm that iteratively improves the value function until it converges to the optimal value function, from which the optimal policy can be derived. The value iteration is based on the Bellman's optimality equation. Here, we implement the value iteration in Python for a 3-by-3 grid world.
  4. Policy Iteration: The policy iteration is another model-based algorithm for solving a Reinforcement Learning problem. It alternates between two steps: policy evaluation and policy improvement. Here, we apply the policy iteration to the same grid world we used for the value iteration. All code are available in a Notebook file.
  5. Multi-Armed Bandit (MAB) and ε-Greedy: In the Multi-armed bandit (MAB), each machine provides a random reward from a probability distribution specific to that machine, that is not known to the agent. The objective of the agent (decision-maker) is to maximize the sum of rewards earned through a sequence of pulling arms. The agent has to explore the arms to acquire knowledge called exploration, and at the same time, it has to optimize its decisions based on the current knowledge it has gained so far, which is called exploitation. The trade-off between exploration and exploitation is something that the agent has to deal with. The ε-greedy action-selection is a mechanism for balancing exploration and exploitation. Here, we simulate the MAB and apply the ε-greedy to it.
  6. Temporal Difference (TD) Learning and SARSA The temporal difference (TD) learning is a combination of Monte Carlo methods and dynamic programming. The TD-based methods are model-free bootstrapping methods. They learn from raw experience with environment without knowing the dynamics of the environment. Here, we introduce the TD(0) and SARSA, the two TD-based methods. TD(0) is used for policy evaluation. In contrast, SARSA is useful for learning the optimal policy. In the notebook file provided here, we simulate the GridWorld environment and learn the optimal policy by the SARSA.
  7. Q-learning: The Q-learning is another TD-based method in reinforcement learning. So, it is a model-free bootstrapping method. However, unlike the SARSA which is an on-policy method, Q-learning is an off-policy method. We bring the Python code for Q-learning and apply it to the same environment introduced for the SARSA.
  8. SARSA with RBF networks: So far, we have used tables for representing value functions $v(s)$ or $q(s,a)$. But, now we are going to use an RBF network for representing $q(s,a)$. Tabular methods are only suitable for small environments, whereas RBF networks are helpful for small to medium-size environments. Here, we implement SARSA with an RBF network which uses SGD (stochastic gradient descent) for weight updating. Again, we test the SARSA+RBFN with the GridWorld environment.
  9. Deep Q-learning: Deep Q-learning is a Q-learning which uses an MLP for Q-values approximation. In reality, we use two MLPs: one is called the policy network and another one is called the target network. Every few episodes, we copy the parameters of the policy network to the target network. Here, we have provided a NoteBook file which implements the deep Q-learning in Python with help of PyTorch. The environment is the same GridWorld employed in the previous post.
  10. Monte Carlo (MC) methods: Monte Carlo methods are model-free methods in Reinforcement Learning that do not use bootstrapping. Here, we implement the on-policy MC control. On-policy means that the behavior policy and target policy are the same. Control means that both policy evaluation and policy improvement are done. The environment by which the MC control is tested is the same GridWorld environment. It is remineded that two variants are available: first-visit and every-visit.
  11. REINFORCE: The REINFORCE method is a policy-based Reinforcement Learning (RL) in which we directly optimize the policy by a gradient ascent over a performance measure. We also use the Monte Carlo approach to compute the actual returns for each episode. The returns scale the performance measure for each state and each action. For modeling the policy, we may use a multi-layer perceptron (MLP) with a softmax at its last layer, which is called the policy network. After training the policy network, we can use the greedy action selection over the policy network to find the best action for each state. Like before, we implement the RL algorithm for the GridWorld.