Briscola is a classic italian card game. You play with 40 cards and each card is worth some points. At each round both you and your opponent play a card. If you win a round, you win the sum of the points of the cards played at that round. The goal is to have more points than your opponent by the end of the game. In this repository I train two neural networks to play briscola.
The (ideal) goal is to have an ai that masters briscola. The actual goal is to have an ai that beats two hard-coded players (the deterministic player and the random player), and some previous versions of itself as often as possible.
I trained a few neural networks with Q-learning, below are the two most interesting.
Navigate to the folder Briscola gui and run Briscola_app.py. You need Flask 2.0.1, tensorflow >= 2.4.0, numpy, pandas, scikit-learn. if you use Windows you might need to change the localhost address from 0.0.0.0 to 127.0.0.1 in Briscola.html.
In Briscola gui there is a very basic flask application + html page to play on your browser vs the deterministic model, vs the random model, and vs the best MLP model I trained. The server-side is in Briscola_app.py, the client-side is the html page Briscola.html in the folder "templates". A sample of the html page is below (the ui can be improved :))
briscola_gif.mp4
The Q-function the neural network approximates the function that sends a state-action pair (s,a) to the probability of winning if at state s we perform action a (i.e. if we play card a). Below there is what math happens under the hood.
TL;DR: this is Q-learning with gamma = 1 and the reward 0 unless I won the game (in which case it is 1), or it is the last hand and it's a draw (in which case it is 1/2).
I trained both a GRU model and a MLP model, and the deepest GRU model (4 GRU layers + dense layer) outperforms all of them.
The function the neural network approximates the function that sends a state-action pair (s,a) to the expectation of the discounted sum of the number of points I make. The math under the hood is very similar to the one above, I report below the salient steps.
TL;DR: this is Q-learning with gamma = .8 and .9, and the reward being the number of points I win or lose at each hand.
I trained both a GRU model and a MLP model, and the dense model (3 dense layers with activation tanh) outperforms all of them.
All the models I trained have an average winning rate vs a random player between 80 and 90%, and vs the deterministic "greedy" player between 70 and 80%.
The first model I trained was the GRU model, with the neural network approximating the probability of winning. After I trained a simpler MLP model with the neural network approximating the (discounted) sum of the expected number of points, and this model beated the previously trained GRU model around 60% of the times. Then I fine-tuned better the simpler MLP model (gamma = .9 seems the best choice), to get the MLP model in the folder MLP_best_model.
Below there is the window = 10 rolling average of the fraction of games losts during training, for the GRU best player. It was playing vs another model previously trained (in green), vs the deterministic "greedy" player (orange) and vs the random player (blue). The results below are after 58K steps of gradient descent. Below there is the window = 10 rolling average of the fraction of games losts during training, for the MLP best player. It was playing vs another model previously trained (in green, in this case, the first MLP that beated the GRU player), vs the deterministic "greedy" player (orange) and vs the random player (blue). The results below are after 75K steps of gradient descent.
In the folder metrics you can find similar graphs for when the first MLP player beated the GRU player.
Probably longer training will lead to better results (in the graph above, it looks like the MLP model is still slowly improving). I suspect that a fine-tuning of the GRU model would outperform the MLP model (however, training and consequently hyperparameters tuning takes much longer for the GRU model).
One could try different reinforcement learning algorithms (actor-critic?).
The training directory contains stuff used to train the neural network.
There are four player classes: random player, deterministic player, human player, and deep player.
Random player: plays randomly.
Deterministic player: This is a greedy player. If it plays first, it plays the card with less points. If it plays second and it can win an hand, it plays the card with less points among those which make it win, if it cannot win it plays the card with less points.
Human player: to play on the command line
Deep player: class that plays with a nn.
There are two neural networks, MyModel and MyModel_dense. The first one (with compute_prob_winning = True, simplified = False) is the architecture of the best model I trained to estimate the probability of winning. The second one (with compute_prob_winning = False) is the architecture of the best model I trained to estimate the number of points.
Contains the environment (i.e. briscola).
Gets next states from a batch of games, and encodes a game for the nn. For encoding a game, each card (including when I have no card) is encoded as a OH vector.
Contains simulate_games_and_record_data which simulates a number of games and records the data in a pd.df. It also contains simulate_games which simulates the games without returning a pd.df but the ratio player_2_wins/number_of_simulations.
The best model's weights.
The notebooks with a sample training loop.
Some pictures.