Skip to content

Commit aa4bd37

Browse files
committed
added first version of example blackjack, added online gridworld
1 parent fff6964 commit aa4bd37

File tree

6 files changed

+402
-36
lines changed

6 files changed

+402
-36
lines changed

CMakeLists.txt

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -27,8 +27,9 @@ include_directories(${SRC})
2727

2828
# build examples
2929
set(EXAMPLES ${EXAMPLES} examples)
30-
add_executable(gridworld ${EXAMPLES}/gridworld.cpp)
31-
add_executable(blackjack ${EXAMPLES}/blackjack.cpp)
30+
add_executable(ex_gridworld_offline ${EXAMPLES}/gridworld_offline.cpp)
31+
add_executable(ex_gridworld_online ${EXAMPLES}/gridworld_online.cpp)
32+
add_executable(ex_blackjack ${EXAMPLES}/blackjack.cpp)
3233

3334
# set output
3435
set(CMAKE_COLOR_MAKEFILE on)

README.md

Lines changed: 21 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -130,6 +130,9 @@ make
130130

131131
There is a folder `examples` which I'm populating with examples, starting from your typical *gridworld* problem,
132132
and then moving on to a *blackjack* program.
133+
Currently there is a classical "Gridworld" example, with two versions:
134+
- an offline on-policy algorithm: `examples/gridworld_offline.cpp` built as `ex_gridworld_offline`
135+
- an online on-policy algorithm: `examples/gridworld_online.cpp` built as `ex_gridworld_online`
133136

134137
## Gridworld
135138

@@ -140,19 +143,27 @@ which is surrounded by blocks into which he can't move (black colour).
140143
The agent starts at blue (x:1,y:8) and the target is the green (x:1,y:1).
141144
The red blocks are fire/danger/a negative reward, and there is a rudimentary maze.
142145

143-
This example uses a staged (stochastic - offline) approach:
146+
There are two versions of the Gridworld, the offline approach:
144147

145148
- first the agent randomly explores until it can find the positive reward (+1.0) grid block
146149
- then it updates its policies
147150
- finally it follows the best policy learnt
148151

152+
And the online approach:
153+
154+
- the agent randomly explores one episode
155+
- then it updates its policies
156+
- then it tries again, this time going after known policies
157+
- it only falls back to random when there does not exist a *positive* best action
158+
- the entire process is repeated until the goal is discovered.
159+
149160
The actual gridworld is saved in a textfile `gridworld.txt` (feel free to change it).
150161
The example `src/gridworld.cpp` provides the minimal code to demonstrate this staged approach.
151162

152163
Once we have loaded the world (using function `populate`) we set the start at x:1, y:8 and then
153164
begin the exploration.
154165

155-
The exploration runs in an inifinite loop in `main` until one criterion is satisfied: the grid block with a **positive** reward is found.
166+
The exploration runs in an inifinite until the grid block with a **positive** reward is found.
156167
Until that happens, the agent takes a *stochastic* (e.g., random) approach and searches the gridworld.
157168
The function:
158169

@@ -205,11 +216,17 @@ and to __which__ state that action will lead to.
205216
206217
A simplified attempt, where one player uses classic probabilities, the dealer (house) simply draws until 17,
207218
and the adaptive agent uses non-deterministic Q-learning in order to play as best as possible.
208-
This is **WORK IN PROGRESS**
219+
220+
The `state` is very simple: a `hand` which is described by the value (min value and max value, depending on the cards held).
221+
The agent ignores the dealer's hand since that would increase the state space,
222+
as well as the label or symbol of the cards held (feel free to change this, simply adapt the "hash" function of `hand`).
223+
224+
This example takes a lot of time to run, as the agent maps the transitional probabilities,
225+
using the observations from playing multiple games.
209226
210227
## TODO
211228
212-
1. complete the blackjack example
229+
1. implement the `boost_serialization` with internal header
213230
2. do the R-Learning continous algorithm
214231
215232
[1]: Sutton, R.S. and Barto, A.G., 1998. Reinforcement learning: An introduction (Vol. 1, No. 1). Cambridge: MIT press

examples/blackjack.cpp

Lines changed: 57 additions & 23 deletions
Original file line numberDiff line numberDiff line change
@@ -37,7 +37,6 @@ struct card
3737
}
3838
};
3939

40-
//
4140
// a 52 playing card constant vector with unicode symbols :-D
4241
const std::deque<card> cards {
4342
{"Ace", "", {1, 11}}, {"Ace", "", {1, 11}}, {"Ace", "", {1, 11}}, {"Ace", "", {1, 11}},
@@ -78,6 +77,7 @@ struct hand
7877
return result;
7978
}
8079

80+
// calculate value of hand - use min value (e.g., when hold an Ace)
8181
unsigned int min_value() const
8282
{
8383
unsigned int result = 0;
@@ -117,6 +117,7 @@ struct hand
117117
cards.begin(), card_compare);
118118
}
119119

120+
// hash this hand for relearn
120121
std::size_t hash() const
121122
{
122123
std::size_t seed = 0;
@@ -215,20 +216,36 @@ struct house : public player
215216
struct client : public player
216217
{
217218
// decide on drawing or staying
218-
bool draw()
219+
bool draw(std::mt19937 & prng,
220+
relearn::state<hand> s_t,
221+
relearn::policy<relearn::state<hand>,
222+
relearn::action<bool>> & map)
219223
{
220-
// `hand` is publicly inherited
221-
// so we can use it to create a new state
222-
// and then randomly decide an action (draw/stay)
223-
// until we have a best action for a given state
224-
return false;
224+
auto a_t = map.best_action(s_t);
225+
auto q_v = map.best_value(s_t);
226+
std::uniform_real_distribution<float> dist(0, 1);
227+
// there exists a "best action" and it is positive
228+
if (a_t && q_v > 0) {
229+
sum_q += q_v;
230+
policy_actions++;
231+
return a_t->trait();
232+
}
233+
// there does not exist a "best action"
234+
else {
235+
random_actions++;
236+
return (dist(prng) > 0.5 ? true : false);
237+
}
225238
}
226239

227240
// return a state by casting self to base class
228241
relearn::state<hand> state() const
229242
{
230243
return relearn::state<hand>(*this);
231244
}
245+
246+
float random_actions = 0;
247+
float policy_actions = 0;
248+
float sum_q = 0;
232249
};
233250

234251
//
@@ -248,61 +265,75 @@ int main(void)
248265
using state = relearn::state<hand>;
249266
using action = relearn::action<bool>;
250267
using link = relearn::link<state,action>;
268+
251269
// policy memory
252270
relearn::policy<state,action> policies;
253271
std::deque<std::deque<link>> experience;
254272

273+
float sum = 0;
274+
float wins = 0;
275+
std::cout << "starting! Press CTRL-C to stop at any time!"
276+
<< std::endl;
255277
start:
256-
// play 10 rounds
278+
// play 10 rounds - then stop
257279
for (int i = 0; i < 10; i++) {
280+
sum++;
258281
std::deque<link> episode;
259-
260282
// one card to dealer/house
261283
dealer->reset_deck();
262284
dealer->insert(dealer->deal());
263-
264285
// two cards to player
265286
agent->insert(dealer->deal());
266287
agent->insert(dealer->deal());
267-
268288
// root state is starting hand
269289
auto s_t = agent->state();
270290

271291
play:
272-
// agent decides to draw
273-
if (agent->draw()) {
292+
// if agent's hand is burnt skip all else
293+
if (agent->min_value() && agent->max_value() > 21) {
294+
goto cmp;
295+
}
296+
// agent decides to draw
297+
if (agent->draw(gen, s_t, policies)) {
274298
episode.push_back(link{s_t, action(true)});
275299
agent->insert(dealer->deal());
276300
s_t = agent->state();
277301
goto play;
278302
}
303+
// agent decides to stay
279304
else {
280305
episode.push_back(link{s_t, action(false)});
281306
}
282-
283307
// dealer's turn
284308
while (dealer->draw()) {
285309
dealer->insert(dealer->deal());
286310
}
287311

288-
std::cout << "\t\033[1;34m player's hand: ";
289-
agent->print();
290-
std::cout << "\033[0m";
291-
std::cout << "\t\033[1;35m dealer's hand: ";
292-
dealer->print();
293-
std::cout << "\033[0m\n";
294-
312+
cmp:
313+
// compare hands, assign rewards!
295314
if (hand_compare(*agent, *dealer)) {
296-
std::cout << "\033[1;32m player wins (•̀ᴗ•́)\033\[0m\r\n";
315+
if (!episode.empty()) {
316+
episode.back().state.set_reward(1);
317+
}
318+
wins++;
297319
}
298320
else {
299-
std::cout << "\033[1;31m dealer wins (◕︵◕)\033\[0m\r\n";
321+
if (!episode.empty()) {
322+
episode.back().state.set_reward(-1);
323+
}
300324
}
301325

302326
// clear current hand for both players
303327
agent->clear();
304328
dealer->clear();
305329
experience.push_back(episode);
330+
std::cout << "\twin ratio: " << wins / sum << std::endl;
331+
std::cout << "\ton-policy ratio: "
332+
<< agent->policy_actions / (agent->policy_actions + agent->random_actions)
333+
<< std::endl;
334+
std::cout << "\tavg Q-value: "
335+
<< (agent->sum_q / agent->policy_actions)
336+
<< std::endl;
306337
}
307338

308339
// at this point, we have some playing experience, which we're going to use
@@ -313,6 +344,9 @@ int main(void)
313344
learner(episode, policies);
314345
}
315346
}
347+
// clear experience - we'll add new ones!
348+
experience.clear();
349+
goto start;
316350

317351
return 0;
318352
}

examples/gridworld.cpp renamed to examples/gridworld_offline.cpp

Lines changed: 9 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -22,6 +22,15 @@
2222
* This is a deterministic, finite Markov Decision Process (MDP)
2323
* and the goal is to find an agent policy that maximizes
2424
* the future discounted reward.
25+
*
26+
* This version of the Gridworld example uses off-line on-policy decision-making.
27+
* What that means is that as the agent moves, it only explores the environment.
28+
* It doens't learn anything until it has finished exploring the environment
29+
* and has discovered the "goal" state.
30+
*
31+
* Due to the nature of PRNG (pseudo-random number generator) this
32+
* version can get stuck into repeating the same actions over and over again,
33+
* therefore if it is running for longer than a minute, feel free to CTRL-C it.
2534
*/
2635
#include <iostream>
2736
#include <sstream>
@@ -31,7 +40,6 @@
3140
#include <chrono>
3241
#include <fstream>
3342
#include <string>
34-
3543
#include "../src/relearn.hpp"
3644

3745
/**
@@ -110,11 +118,9 @@ struct world
110118
using state = relearn::state<grid>;
111119
using action = relearn::action<direction>;
112120

113-
///
114121
/// load the gridworld from the text file
115122
/// boundaries are `occupied` e.g., can't move into them
116123
/// fire/danger blocks are marked with a reward -1
117-
///
118124
world populate()
119125
{
120126
std::ifstream infile("../examples/gridworld.txt");
@@ -135,9 +141,7 @@ world populate()
135141
return environment;
136142
}
137143

138-
///
139144
/// Decide on a stochastic (random) direction and return the next grid block
140-
///
141145
struct rand_direction
142146
{
143147
std::pair<direction,grid> operator()(std::mt19937 & prng,
@@ -226,9 +230,7 @@ std::deque<relearn::link<S,A>> explore(const world & w,
226230
return episode;
227231
}
228232

229-
///
230233
/// Stay On-Policy and execute the action dictated
231-
///
232234
template <typename S, typename A>
233235
void on_policy(const world & w,
234236
relearn::policy<S,A> & policy_map,

0 commit comments

Comments
 (0)