Skip to content

Implements a recommender system to recommend what opening moves a chess player should play based on their exhibited play style.

Notifications You must be signed in to change notification settings

aks5bx/ChessOpeningRecommender

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

92 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Recommending Chess Openings based on Play Style

Project still in progress

Authors: Adi Srikanth and Will Gorick

Time: Winter 2023

Tags: Chess, Recommender Systems

Overview

General Goal

This general goal of this project is to, for any active lichess user:

  1. Understand their style of play
  2. Approximate their style of play to that of other lichess users
  3. Use play style similarity information to power a recommender system that recommends what opening a user should play based on:
    • What players with similar playing styles play
    • What players with similar playing styles play AND have success with
  4. Use play style similarity information to assign every user a "Celeb GM/IM" they are most similar to.

Tasks

Current Status

We have a version 1 recommender system working powered by around 200K games from lichess. (Note: we pulled almost 70M games, but a very small fraction contains engine evaluation data, which we use in our recommendations). Our recommender system has a solid base of features and can reliably recommend openings to a user with game data.

Version 1 Checklist

  • Develop overal goal/plan
  • Formalize data pipeline
  • Data storage
  • Feature Engineering
  • Similarity Scoring
  • Recommender System
  • Celeb GM/IM
  • Front End/Serving

Setup

Environment

You can install the required packages for this project using the command pip install -r requirements.txt

Note: you must use a Python version >= 3.8 in order to meet the PyTorch dependency.

Quickstart

In order to run this without having to run the entire data pipeline, go to the Google Drive and download the files full_feature_df.csv and full_label_df.csv. With these two files, you will be able to run the recommender system.

See the repository structure below for details on where to store the files.

Repository Structure

[ROOT]
│
└───data 
│   |   full_feature_df.csv (must download - available in google drive)
│   |   full_label_df.csv (must download - available in google drive)
│   |   lichess_2019_06.pgn (*)
│   |   lichess_2015_06.pgn (*)
│   |   lichess_2013_06.pgn (*)
│   |   processed_2019_06.json (**)
│   |   processed_2019_06_df.csv (**)
│   |
│   |   * - must download yourself, but only necessary to run full data pipeline
│   |   ** - generated during full data pipeline process
│
└───src
│   |   process_data.py
│   |   format_data.py
│   |   chess_utils.py
│
└───notebooks
│   |   scratch.ipynb
│
└───recommender
│   |   recommender_model.py
│   |   run_recommender.py
│
└───gitignore
│
└───config.json
│
└───requirements.txt
│
└───README.md

Data Pipeline

Diagram

Describes process of generating data for the recommender system. Begins with the raw data (from the Lichess database) at the top and ends with the feature dataframe (x data) and the label dataframe (y data) at the bottom.

Entity Descriptions:

  • Orange Bordered Entities: processes, specifically python scripts that take in a data file and output a different data file
  • Purple Thin-bordered Entities: intermediate data files generated by a process and fed into the next process
  • Purple Thick-bordered Entities: permanent files, either the raw data from the Lichess database or the final data files that can be fed directly to the recommendery system with no further modifications

Click Image to Enlarge

Step 1: Ingesting Raw Data

We source our data from the lichess open database. Specifically, we download our data in the .pgn.zst format and extract it as one .pgn file.

Next, we run our data through the process_data.py script. This handles the following:

  • Parses the single file from the lichess database and extracts individual games
  • Extracts the user, user elo, opening information, and PGN data for each game
  • Stores each game in a dictionary

In order to run this file, you must first update the config.json file to make sure the data stored under live_run points to your saved pgn file. Specifically, the dataset_file key should point to your pgn filename. The processsed_file key can point to any filename, but should generally follow naming conventions.

Additionally, the dataset key points to the name of the dataset, which is also another nested data category in the config file. Here, fill in the number of games (which you can find on the lichess database) and the number of lines in your file, which you can compute by running wc -l <filename> in your terminal.

At this point, you can go ahead and run: python src/process_data.py

Step 2: Further Processing and Wrangling

We additionally include the script open_data.py, which can "open up" our data and store it in the form of a pandas dataframe. This is useful for doing basic exploratory data analysis on our dataset. Additionally, we are able to easily confirm the number of unique players we have in our dataset.

To run this step, run python src/format_data.py. If you have strayed from the file naming conventions, you may have to adjust the script slightly.

Step 3: Postprocessing

The ipynotebook notebooks/postprocess_data.ipynb contains cells that help finish up the data pipeline stage. These steps need to be ported over to Python, but this has been deprioritized for the time being.

Running the Initial Dataset section of the notebook will complete the data pipeline process for a fresh data load.

Running the Append Data section of the notebook will combine two independently loaded datasets into one dataset. This is helpful if you choose to load one month of data at a time. In this case, you can load Month A and save a first file, load month B and save a second file, and then run the Append Data section to combined the two files into one cohesive output.

Utils

The file chess_utils.py contains various utility functions and is called on throughout the data pipeline process.

The file has methods for both individual pieces of data, but also has methods that apply those same transformations to an entire dataframe.

Recommender System

Overview

The recommender system used is fairly simple, and rests on two main components:

  1. Embeddings
  2. Feed-Forward Network

Embeddings

The embeddings are standard Pytorch embeddings. Specifically, they take a single user id and represents the user id as a n-dimensional vector. This is useful because the embeddings can be learned and fine tuned for a single user (after multiple rounds of learning) and the embeddings can store information specific to the user.

Feed Forward Network

The feed-forward network uses three linear layers and ReLU activation functions. The first linear layer takes the user attribute data and runs it through a layer. The next two layers are used to expand dimensionality and ultimately produce an output vector that is of size p, where p is the number of unique openings we are considering. Ultimately, the goal is for the output vector to predict the opening evaluation a user would end up with if they played a certain opening.

Loss Function

The recommender system is trained using Mean Squared Error Loss as implemented by PyTorch using torch.nn.MSELoss(). We also note that no form of binary loss can be used because we are interested in predicting the continuous value for opening evaluation.

Features

We use the following features in order to describe a user's playing style:

  • opening_eval: aggregate engine eval of the last 3 moves of the opening
  • opening_id_n: indicator variable for whether or not an opening of id n was played
  • move 5 <piece>: number of times each piece type has been moved in the first 5 moves
  • move 10 <piece>: number of times each piece type has been moved between moves 5-10
  • move 15 <piece>: number of times each piece type has been moved between moves 5-15
  • move final <piece>: number of times each piece type has been moved between moves 15-end of game
  • move N captures: number of captures for the four move categories noted above
  • move N checks: number of checks given for the four move categories noted above
  • move N pawn density: number of pawns in central 16 squares for the four move categories noted above

Files and Code

Architecture

The architecture of the model is created and stored in recommender_model.py. Here we define the following:

  • ChessGamesDataset(): Custom dataset
  • get_dataloader(): dataloader
  • MatrixFactorization(): Model
  • train_epochs(): Training Loop
  • test(): test/validation function
  • predict_user(): function to generate recommendations for a single user

Execution

The actual training of the model is conducted in run_recommender.py. We note here that by default, the GPU accelerator is defined to be mps, which supports the newest fleet of Macbooks (M1). If using a virtual machine such as an NVidia cluster, this should be changed to support cuda. Otherwise, the program will default to cpu.

You can run the recommender by running python recommender/run_recommender.py. The script takes various command line arguments. To clarify this, the argument parsing function is included below:

def parse_args():
    parser = argparse.ArgumentParser()
    parser.add_argument("--epochs", "-e", type=int, default=10)
    parser.add_argument("--batch_size", "-b", type=int, default=64)
    parser.add_argument("--alpha", "-a", type=float, default=0.001)
    parser.add_argument("--write", "-w", type=str, default='Y')
    parser.add_argument("--loss", "-l", type=str, default='MSE')

    return parser.parse_args()

Here is a sample run that utilizes all of the command line arguments:

python recommender/run_recommender.py --epochs 10 --batch_size 128 --alpha 0.01 --write Y --loss MSE

Training Time

Using a Python 3 Google Compute Engine backend (High-Ram and Premium GPU), we can complete 10 epochs of training in around 30 minutes.

Sample Prediction

Below is a sample prediction for the lichess user Igracsaha.

——————————————————————————————————————————————————
Sample Prediction
Recommendations for  Igracsaha
#  1  Opening Recommendation : Queen's Pawn Game
#  2  Opening Recommendation : Italian Game
#  3  Opening Recommendation : English Opening
——————————————————————————————————————————————————

The results here are interesting. A brief OpeningTree analysis of Igracsaha's games shows that they clearly prefer 1. e4, playing over 95% of their openings with this beginning. However, some variants of d4 openings (the #1 recommendation) score well for this user (specifically 1. Nf3 scores better than 1. e4).

Interestingly enough, the #2 recommendation is the Italian Game. This is in fact the most common opening played by Igracsaha and scores around 53% for this user.

About

Implements a recommender system to recommend what opening moves a chess player should play based on their exhibited play style.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published