This is a Byte Pair Encoding (BPE) tokenizer for chess Portable Game Notation (PGN).
You can install it with your package manager of choice:
uv add pgn-tokenizer
pip install pgn-tokenizer
It exposes a simple interface with .encode()
and .decode()
methods, and a .vocab_size
property, but you can also access the underlying PreTrainedTokenizerFast
class from the transformers
library via the .tokenizer
property.
from pgn_tokenizer import PGNTokenizer
# Initialize the tokenizer
tokenizer = PGNTokenizer()
# Tokenize a PGN string
tokens = tokenizer.encode("1.e4 Nf6 2.e5 Nd5 3.c4 Nb6")
# Decode the tokens back to a PGN string
decoded = tokenizer.decode(tokens)
# get vocab from underlying tokenizer class
vocab = tokenizer.tokenizer.get_vocab()
It is uses the tokenizers
library from Hugging Face for training the tokenizer and the transformers
library from Hugging Face for initializing the tokenizer from the pretrained tokenizer model for faster tokenization.
Note: This is part of a work-in-progress project to investigate how language models might understand chess without an engine or any chess-specific knowledge.
More traditional, language-focused BPE tokenizer implementations are not suited for PGN strings because they are more likely to break the actual moves apart.
For example 1.e4 Nf6
would likely be tokenized as 1
, .
, e
, 4
, N
, f
, 6
or 1
, .e
, 4
,
, N
, f
, 6
depending on the tokenizer's vocabulary, but with the specialized PGN tokenizer it would be tokenized as 1.
, e4
, Nf6
.
Here is a visualization of the vocabulary of this specialized PGN tokenizer compared to the BPE tokenizer vocabularies of the cl100k_base
(the vocabulary for the gpt-3.5-turbo
and gpt-4
models' tokenizer) and the o200k_base
(the vocabulary for the gpt-4o
model's tokenizer):
Note: The tokenizer was trained with ~2.8 Million chess games in PGN notation with a target vocabulary size of 4096
.
Note: These visualizations were generated with a function adapted from an educational Jupyter Notebook in the tiktoken
repository.
- @karpathy for the Let's build the GPT Tokenizer tutorial
- Hugging Face for the
tokenizers
andtransformers
libraries. - Kaggle user MilesH14, whoever you are for the now-missing dataset of 3.5 million chess games referenced in many places, including this research documentation