Skip to content

Reinforcement Learning agent trained to play variations of Wuziqi based on implementation of AlphaZero and AlphaGo Zero.

Notifications You must be signed in to change notification settings

ruiawang/Reinforcement-Learning-Wuziqi

Repository files navigation

Reinforcement-Learning-Wuziqi

Reinforcement Learning agent implementation of AlphaZero and AlphaGo Zero to play variations of Wuziqi (Gomoku).

Based on the papers:

Introduction:

Wuziqi (五子棋) is a board game (typically) played on a $15$ by $15$ board where the aim is to connect 5 of your stones in a row. Its other names are Gomoku/Gobang in other languages.

This game, like tic-tac-toe, is an example of an m,n,k-game. Tic-tac-toe for example, is 3,3,3, while traditional Wuziqi is 15,15,5.

I used to play this game and slight variations with my grandfather as a kid. One such variation was to connect 4 in a row to win rather than 5, that we called Siziqi (四子棋).

In the interest of computation and speed, I have condensed the board size to $6$ by $6$ for Siziqi, and $7$ by $7$ for Wuziqi for our models to train on.

Reinforcement Learning Agent and Training

As outlined in the papers, Alpha(Go) Zero uses reinforcement learning, based on Monte Carlo Tree Search combined with a policy-value neural network to guide decisions and moves. While the rules of Go, Chess, and Wuziqi are all different, the algorithm is generalizable to manny different games.

However, since Wuziqi has much simpler rules and conditions, I simplified the neural network architecture that was outlined in the AlphaGo Zero paper but still maintained the overall structure.

The neural network learns on previous self-play data, of which is augmented through rotations and mirrored flips of the board state, which is using the fact that Wuziqi, like Go, is invariant under rotation and flips, a technique also used in the training of AlphaGo Zero to play Go.

As mentioned, the RL agent employs Monte Carlo Tree Search with a policy-value neural network. It plays against itself by playing against a basic instance of Monte Carlo Tree Search, with no neural network, with a larger number of playouts (1000, 2000, 3000, etc.) compared to the RL agent's number of playouts (400).

The agent is trained in a total of $750$ self-play games.

Results:

Siziqi model managed to beat basic MCTS 10-0:

  • with 1000 playouts by 550 plays.
  • with 2000 playouts by 650 plays.
  • with 3000 playouts by 750 plays.

Wuziqi model managed to beat basic MCTS:

  • with 1000 playouts with 8-2 record by 400 plays.

Files:

  • Siziqi models: Folder that contains the models trained to play Siziqi (6,6,4-game).
  • Wuziqi models: Folder that contains the models trained to play Wuziqi (7,7,5-game).
  • fig: Plots of Wuziqi model's loss and entropy during training.
  • MonteCarloTreeSearch.py: Implementation of the Monte Carlo Tree Search agent/player applied to the games used in the Alpha(Go) Zero papers
  • MonteCarloTreeSearchBasic.py: A basic implementation of Monte Carlo Tree Search to serve as the opponent to the RL agent.
  • game.py: Implementation of the rules and board for general m,n,k-games.
  • playtest.py: Playtest the trained models at the game.
  • policy_value_network.py: Implementation of the neural network used by the RL agent.
  • train.py: Implementation of the training pipeline for the agent.

About

Reinforcement Learning agent trained to play variations of Wuziqi based on implementation of AlphaZero and AlphaGo Zero.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages