added first version of example blackjack, added online gridworld

alexge233 · alexge233 · commit aa4bd374d04d · 2017-07-12T18:21:15.000+01:00
diff --git a/CMakeLists.txt b/CMakeLists.txt
@@ -27,8 +27,9 @@ include_directories(${SRC})
 
 # build examples
 set(EXAMPLES ${EXAMPLES} examples)
-add_executable(gridworld ${EXAMPLES}/gridworld.cpp)
-add_executable(blackjack ${EXAMPLES}/blackjack.cpp)
+add_executable(ex_gridworld_offline ${EXAMPLES}/gridworld_offline.cpp)
+add_executable(ex_gridworld_online  ${EXAMPLES}/gridworld_online.cpp)
+add_executable(ex_blackjack         ${EXAMPLES}/blackjack.cpp)
 
 # set output
 set(CMAKE_COLOR_MAKEFILE on)
diff --git a/README.md b/README.md
@@ -130,6 +130,9 @@ make
 
 There is a folder `examples` which I'm populating with examples, starting from your typical *gridworld* problem, 
 and then moving on to a *blackjack* program.
+Currently there is a classical "Gridworld" example, with two versions:
+- an offline on-policy algorithm: `examples/gridworld_offline.cpp` built as `ex_gridworld_offline`
+- an online on-policy algorithm: `examples/gridworld_online.cpp` built as `ex_gridworld_online`
 
 ## Gridworld
 
@@ -140,19 +143,27 @@ which is surrounded by blocks into which he can't move (black colour).
 The agent starts at blue (x:1,y:8) and the target is the green (x:1,y:1).
 The red blocks are fire/danger/a negative reward, and there is a rudimentary maze.
 
-This example uses a staged (stochastic - offline) approach:
+There are two versions of the Gridworld, the offline approach:
 
 - first the agent randomly explores until it can find the positive reward (+1.0) grid block
 - then it updates its policies
 - finally it follows the best policy learnt
 
+And the online approach:
+
+- the agent randomly explores one episode
+- then it updates its policies
+- then it tries again, this time going after known policies
+- it only falls back to random when there does not exist a *positive* best action
+- the entire process is repeated until the goal is discovered.
+
 The actual gridworld is saved in a textfile `gridworld.txt` (feel free to change it).
 The example `src/gridworld.cpp` provides the minimal code to demonstrate this staged approach.
 
 Once we have loaded the world (using function `populate`) we set the start at x:1, y:8 and then
 begin the exploration.
 
-The exploration runs in an inifinite loop in `main` until one criterion is satisfied: the grid block with a **positive** reward is found.
+The exploration runs in an inifinite until the grid block with a **positive** reward is found.
 Until that happens, the agent takes a *stochastic* (e.g., random) approach and searches the gridworld.
 The function:
 
@@ -205,11 +216,17 @@ and to __which__ state that action will lead to.
 
 A simplified attempt, where one player uses classic probabilities, the dealer (house) simply draws until 17,
 and the adaptive agent uses non-deterministic Q-learning in order to play as best as possible.
-This is **WORK IN PROGRESS**
+
+The `state` is very simple: a `hand` which is described by the value (min value and max value, depending on the cards held).
+The agent ignores the dealer's hand since that would increase the state space,
+as well as the label or symbol of the cards held (feel free to change this, simply adapt the "hash" function of `hand`).
+
+This example takes a lot of time to run, as the agent maps the transitional probabilities,
+using the observations from playing multiple games.
 
 ## TODO
 
-1. complete the blackjack example
+1. implement the `boost_serialization` with internal header
 2. do the R-Learning continous algorithm
 
 [1]: Sutton, R.S. and Barto, A.G., 1998. Reinforcement learning: An introduction (Vol. 1, No. 1). Cambridge: MIT press
diff --git a/examples/blackjack.cpp b/examples/blackjack.cpp
@@ -37,7 +37,6 @@ struct card
     }
 };
 
-//
 // a 52 playing card constant vector with unicode symbols :-D
 const std::deque<card> cards {
     {"Ace",  "♠", {1, 11}}, {"Ace",  "♥", {1, 11}}, {"Ace",  "♦", {1, 11}}, {"Ace",  "♣", {1, 11}},
@@ -78,6 +77,7 @@ struct hand
         return result; 
     }
 
+    // calculate value of hand - use min value (e.g., when hold an Ace)
     unsigned int min_value() const
     {
         unsigned int result = 0;
@@ -117,6 +117,7 @@ struct hand
                                    cards.begin(),  card_compare);
     }
 
+    // hash this hand for relearn
     std::size_t hash() const
     {
         std::size_t seed = 0;
@@ -215,20 +216,36 @@ struct house : public player
 struct client : public player
 {
     // decide on drawing or staying
-    bool draw()
+    bool draw(std::mt19937 & prng,
+              relearn::state<hand> s_t,
+              relearn::policy<relearn::state<hand>,
+                              relearn::action<bool>> & map)
     {
-        // `hand` is publicly inherited
-        //  so we can use it to create a new state
-        //  and then randomly decide an action (draw/stay)
-        //  until we have a best action for a given state
-        return false;
+        auto a_t = map.best_action(s_t);
+        auto q_v = map.best_value(s_t);
+        std::uniform_real_distribution<float> dist(0, 1);
+        // there exists a "best action" and it is positive
+        if (a_t && q_v > 0) {
+            sum_q += q_v;
+            policy_actions++;
+            return a_t->trait(); 
+        }
+        // there does not exist a "best action"
+        else {
+            random_actions++;
+            return (dist(prng) > 0.5 ? true : false);
+        }
     }
 
     // return a state by casting self to base class
     relearn::state<hand> state() const
     {
         return relearn::state<hand>(*this);
     }
+
+    float random_actions = 0;
+    float policy_actions = 0;
+    float sum_q = 0;
 };
 
 //
@@ -248,61 +265,75 @@ int main(void)
     using state  = relearn::state<hand>;
     using action = relearn::action<bool>;
     using link   = relearn::link<state,action>;
+
     // policy memory
     relearn::policy<state,action> policies;
     std::deque<std::deque<link>>  experience;
     
+    float sum  = 0;
+    float wins = 0;
+    std::cout << "starting! Press CTRL-C to stop at any time!" 
+              << std::endl;
     start:
-    // play 10 rounds
+    // play 10 rounds - then stop
     for (int i = 0; i < 10; i++) {
+        sum++;
         std::deque<link> episode;
-
         // one card to dealer/house
         dealer->reset_deck();
         dealer->insert(dealer->deal());
-
         // two cards to player
         agent->insert(dealer->deal());
         agent->insert(dealer->deal());
-
         // root state is starting hand
         auto s_t = agent->state();
 
         play:
-        // agent decides to draw
-        if (agent->draw()) {
+        // if agent's hand is burnt skip all else
+        if (agent->min_value() && agent->max_value() > 21) {
+            goto cmp;
+        }
+        // agent decides to draw        
+        if (agent->draw(gen, s_t, policies)) {
             episode.push_back(link{s_t, action(true)});
             agent->insert(dealer->deal());
             s_t = agent->state();
             goto play;
         }
+        // agent decides to stay
         else {
             episode.push_back(link{s_t, action(false)});
         }
-
         // dealer's turn
         while (dealer->draw()) {
             dealer->insert(dealer->deal());
         }
 
-        std::cout << "\t\033[1;34m player's hand: ";
-        agent->print();
-        std::cout << "\033[0m";
-        std::cout << "\t\033[1;35m dealer's hand: ";
-        dealer->print();
-        std::cout << "\033[0m\n";
-
+        cmp:
+        // compare hands, assign rewards!
         if (hand_compare(*agent, *dealer)) {
-            std::cout << "\033[1;32m player wins (•̀ᴗ•́)\033\[0m\r\n";
+            if (!episode.empty()) {
+                episode.back().state.set_reward(1); 
+            }
+            wins++;
         }
         else {
-            std::cout << "\033[1;31m dealer wins (◕︵◕)\033\[0m\r\n";
+            if (!episode.empty()) {
+                episode.back().state.set_reward(-1); 
+            }
         }
 
         // clear current hand for both players
         agent->clear();
         dealer->clear();
         experience.push_back(episode);
+        std::cout << "\twin ratio: " << wins / sum << std::endl;
+        std::cout << "\ton-policy ratio: " 
+                  << agent->policy_actions / (agent->policy_actions + agent->random_actions) 
+                  << std::endl;
+        std::cout << "\tavg Q-value: "
+                  << (agent->sum_q / agent->policy_actions)
+                  << std::endl;
     }
 
     // at this point, we have some playing experience, which we're going to use
@@ -313,6 +344,9 @@ int main(void)
             learner(episode, policies);
         }
     }
+    // clear experience - we'll add new ones!
+    experience.clear();
+    goto start;
 
     return 0;
 }
diff --git a/examples/gridworld_offline.cpp b/examples/gridworld_offline.cpp
@@ -22,6 +22,15 @@
  * This is a deterministic, finite Markov Decision Process (MDP) 
  * and the goal is to find an agent policy that maximizes 
  * the future discounted reward.
+ *
+ * This version of the Gridworld example uses off-line on-policy decision-making.
+ * What that means is that as the agent moves, it only explores the environment.
+ * It doens't learn anything until it has finished exploring the environment
+ * and has discovered the "goal" state.
+ *
+ * Due to the nature of PRNG (pseudo-random number generator) this
+ * version can get stuck into repeating the same actions over and over again,
+ * therefore if it is running for longer than a minute, feel free to CTRL-C it.
  */
 #include <iostream>
 #include <sstream>
@@ -31,7 +40,6 @@
 #include <chrono>
 #include <fstream>
 #include <string>
-
 #include "../src/relearn.hpp"
 
 /**
@@ -110,11 +118,9 @@ struct world
 using state = relearn::state<grid>;
 using action = relearn::action<direction>;
 
-///
 /// load the gridworld from the text file
 /// boundaries are `occupied` e.g., can't move into them
 /// fire/danger blocks are marked with a reward -1
-///
 world populate()
 {
     std::ifstream infile("../examples/gridworld.txt");
@@ -135,9 +141,7 @@ world populate()
     return environment;
 }
 
-///
 /// Decide on a stochastic (random) direction and return the next grid block
-///
 struct rand_direction
 {
     std::pair<direction,grid> operator()(std::mt19937 & prng, 
@@ -226,9 +230,7 @@ std::deque<relearn::link<S,A>> explore(const world & w,
     return episode;
 }
 
-///
 /// Stay On-Policy and execute the action dictated
-///
 template <typename S, typename A>
 void on_policy(const world & w, 
                relearn::policy<S,A> & policy_map,
diff --git a/examples/gridworld_online.cpp b/examples/gridworld_online.cpp
diff --git a/src/relearn.hpp b/src/relearn.hpp