-
Notifications
You must be signed in to change notification settings - Fork 25
Table Based Q Learning In Under 1KB
Q-learning is an algorithm in which an agent interacts with its environment and collects rewards for taking desirable actions.
The simplest implementation of Q-learning is referred to as tabular or table-based Q-learning. There are tons of articles, tutorials, etc. already available on the web which describe Q-learning so I won't go into excruciating detail here. Instead, I want to show how efficiently table-base Q-learning can be done using tinymind. In this article, I will describe how tinymind implements Q-learning using C++ templates and fixed-point (Q-format) numbers as well as go thru the example in the repo.
A common table-based Q-learning problem is to train a virtual mouse to find its way out of a maze to get the cheese (reward). Tinymind contains an example program which demonstrates how the Q-learning template library works.
In the example program, we define the maze:
We define all of our types in a common header so that we can separate the maze learner code from the training and file management code. I have done this so that we can measure the amount of code and data required for the Q-learner alone. The common header defines the maze as well as the type required to hold states and actions:
We train the mouse by dropping it into a randomly-selected room (or on the outside of it where the cheese is). The mouse starts off by taking a random action from a list of available actions at each step. The mouse receives a reward only when he finds the cheese (e.g. makes it to position 5 outside the maze). If the mouse is dropped into position 5, he has to learn to stay there and not wander back into the maze.
#Building The Example Starting from cppnnml/examples/maze, I will create a directory to hold the executable file and build the example.
mkdir -p ~/maze
g++ -O3 -o ~/maze/maze maze.cpp mazelearner.cpp -I../../cpp
This builds the maze leaner example program and places the executable file at ~/maze. We can now cd into the directory where the executable file was generated and run the example program.
cd ~/maze
./maze
When the program finishes running, you'll see the last of the output messages, something like this:
take action 5
*** starting in state 3 ***
take action 4
take action 5
*** starting in state 2 ***
take action 3
take action 2
take action 3
take action 4
take action 5
*** starting in state 3 ***
take action 4
take action 5
*** starting in state 5 ***
take action 5
Your messages may be slightly different since we're starting our mouse in a random room on every iteration. During example program execution, we save all mouse activity to files (maze_training.txt and maze_test.txt). Within the training file, the mouse takes random actions for the first 400 episodes and then the randomness is decreased from 100% random to 0% random for another 100 episodes. To see the first few training iterations you can do this:
head maze_training.txt
You should see something like this:
1,3,4,0,4,5,
4,5,
2,3,1,3,4,3,1,5,
5,5,
4,5,
1,5,
3,2,3,4,3,4,5,
0,4,0,4,0,4,0,4,5,
1,3,1,5,
5,4,0,4,3,1,3,1,5,
Again, your messages will look slightly different. The first number is the start state and every comma-separated value after that is the random movement of the mouse from room to room. Example: In the first line above we started in room 1, then moved to 3, then 4, then 0, then back to 4, then to 5. Since 5 is our goal state, we stopped. The reason this looks so erratic is for the first 400 iterations of training we make a random decision from our possible actions. Once we get to state 5, we get our reward and stop.
The QLearner class is defined in qlearn.hpp, which contains all components (e.g. environment, Q-table, reward policy) necessary for Q-learning process. The following diagram displays the relation of the defined classes to form a Q-Learner.