Skip to content

Commit

Permalink
chore: add pre-commit hooks
Browse files Browse the repository at this point in the history
  • Loading branch information
ericrallen committed Jan 24, 2025
1 parent 94d2be1 commit acf3c2c
Show file tree
Hide file tree
Showing 10 changed files with 352 additions and 6 deletions.
55 changes: 55 additions & 0 deletions .pre-commit-config.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,55 @@
default_install_hook_types: [pre-commit, commit-msg, pre-push]

repos:
- repo: https://github.com/pre-commit/pre-commit-hooks
rev: v5.0.0
hooks:
- id: check-merge-conflict
name: No merge conflict markers
stages: [pre-commit]
- id: mixed-line-ending
name: No mixed line endings (LF and CRLF)
stages: [pre-commit]

- repo: https://github.com/psf/black
rev: 22.6.0
hooks:
- id: black
stages: [pre-commit]
types: [python]

- repo: https://github.com/charliermarsh/ruff-pre-commit
rev: v0.0.262
hooks:
- id: ruff
args: [--fix]
stages: [pre-commit]
types: [python]

- repo: local
hooks:
- id: mypy-all
name: MyPy (all)
pass_filenames: false
files: pgn-tokenizer/
entry: uv run mypy . --disable-error-code=import-untyped
stages: [pre-push]
language: system
- id: pytest
name: pytest
stages: [pre-push]
language: system
entry: uv run pytest
types: [python]
pass_filenames: false
verbose: true

# conventional commits
- repo: https://github.com/espressif/conventional-precommit-linter
rev: v1.6.0
hooks:
- id: conventional-precommit-linter
stages: [commit-msg]
args:
- --types=chore,feat,fix,ci,docs,refactor,revert,test
- --allow-breaking
71 changes: 71 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,3 +2,74 @@

This is a Byte Pair Encoding (BPE) tokenizer for chess Portable Game Notation (PGN).

It is uses the [`tokenizers`](https://huggingface.co/docs/tokenizers/) library from Hugging Face for training the tokenizer and the [`transformers`](https://huggingface.co/docs/transformers/) library from Hugging Face for initializing the tokenizer from the pretrained tokenizer model for faster tokenization.

**Note**: This is part of a work-in-progress project to investigate how language models might understand chess without an engine or any chess-specific knowledge.

## Tokenizer Comparison

More traditional, language-focused BPE tokenizer implementations are not suited for PGN strings because they are more likely to break the actual moves apart.

For example `1.e4 Nf6` would likely be tokenized as `1`, `.`, `e`, `4`, ` N`, `f`, `6` or `1`, `.e`, `4`, ` `, ` N`, `f`, `6` depending on the tokenizer's vocabulary, but with the specialized PGN tokenizer it would be tokenized as `1.`, `e4`, ` Nf6`.

### Visualization

Here is a visualization of the vocabulary of this specialized PGN tokenizer compared to the BPE tokenizer vocabularies of the `cl100k_base` (the vocabulary for the `gpt-3.5-turbo` and `gpt-4` models' tokenizer) and the `o200k_base` (the vocabulary for the `gpt-4o` model's tokenizer):

#### PGN Tokenizer

![PGN Tokenizer Visualization](./docs/assets/pgn-tokenizer.png)

**Note**: The tokenizer was trained with ~2.8 Million chess games in PGN notation with a target vocabulary size of `4096`.

#### GPT-3.5-turbo and GPT-4 Tokenizers

![GPT-4 Tokenizer Visualization](./docs/assets/gpt-4-tokenizer.png)

#### GPT-4o Tokenizer

![GPT-4o Tokenizer Visualization](./docs/assets/gpt-4o-tokenizer.png)

These were all generated with a function adapted from an [educational Jupyter Notebook in the `tiktoken` repository](https://github.com/openai/tiktoken/blob/main/tiktoken/_educational.py#L186).

## Installation

You can install it with your package manager of choice:

### uv

```bash
uv add pgn-tokenizer
```

### pip

```bash
pip install pgn-tokenizer
```

## Usage

It exposes a simple interface with `.encode()` and `.decode()` methods, and a `.vocab_size` property, but you can also access the underlying `PreTrainedTokenizerFast` class from the `transformers` library via the `.tokenizer` property.

```python
from pgn_tokenizer import PGNTokenizer

# Initialize the tokenizer
tokenizer = PGNTokenizer()

# Tokenize a PGN string
tokens = tokenizer.encode("1.e4 Nf6 2.e5 Nd5 3.c4 Nb6")

# Decode the tokens back to a PGN string
decoded = tokenizer.decode(tokens)

# get vocab from underlying tokenizer class
vocab = tokenizer.tokenizer.get_vocab()
```

## Acknowledgements

- [@karpathy](https://github.com/karpathy) for the [Let's build the GPT Tokenizer tutorial](https://youtu.be/zduSFxRajkE)
- [Hugging Face](https://huggingface.co/) for the [`tokenizers`](https://huggingface.co/docs/tokenizers/) and [`transformers`](https://huggingface.co/docs/transformers/) libraries.
- Kaggle user [MilesH14](https://www.kaggle.com/milesh14), whoever you are for the now-missing dataset of 3.5 million chess games referenced in many places, including this [research documentation](https://chess-research-project.readthedocs.io/en/latest/)
41 changes: 41 additions & 0 deletions docs/CONTRIBUTING.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,41 @@
# Contributor Guide

**Note**: More coming soon. This is a work-in-progress and the underlying dataset was just deleted from Kaggle.

1. [Fork the `DVDAGames/pgn-tokenizer` repository](https://github.com/DVDAGames/pgn-tokenizer/fork):
2. Clone your fork:

```bash
git clone git@github.com:<your-username>/pgn-tokenizer.git
```

3. Install dependencies:

```bash
uv install
```

4. Make your changes:
5. Test your changes:

```bash
uv run pytest
```

6. Commit your changes using [Conventional Commits](https://www.conventionalcommits.org/)
7. Push to your fork:

```bash
git push origin <your-branch>
```

8. Create a [Pull Request](https://github.com/DVDAGames/pgn-tokenizer/compare)
9. Wait for review, approval, and a virtual high five.

## Development Scripts

There are a few scripts in the `scripts` directory that can be useful for development:

- `clean-dataset.py`: Cleans the original weirdness out of the dataset PGN notation
- `format-dataset-for-training.py`: Formats the cleaned dataset for tokenizer training by adding the special `[g_start]` and `[g_end]` tokens to the beginning and end of each game
- `train.py`: Trains the tokenizer on the formatted dataset and saves the model
Binary file added docs/assets/gpt-4-tokenizer.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/assets/gpt-4o-tokenizer.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/assets/pgn-tokenizer.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
4 changes: 4 additions & 0 deletions pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -25,8 +25,12 @@ build-backend = "hatchling.build"

[dependency-groups]
dev = [
"black>=24.10.0",
"datasets>=3.2.0",
"kagglehub>=0.3.6",
"mypy>=1.14.1",
"polars>=1.20.0",
"pre-commit>=4.1.0",
"ruff>=0.9.3",
"tokenizers>=0.21.0",
]
2 changes: 1 addition & 1 deletion scripts/train.py
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,7 @@
)

TRAINING_DATA_PATH = f"./.data/datasets/{DATASET_NAME}/"
OUTPUT_PATH = "./src/pgn_tokenizer"
OUTPUT_PATH = "./src/pgn_tokenizer/config"

FULL_DATASET_PATH = f"{TRAINING_DATA_PATH}/full"
SAMPLE_DATASET_PATH = f"{TRAINING_DATA_PATH}/sample"
Expand Down
8 changes: 3 additions & 5 deletions src/pgn_tokenizer/__init__.py
Original file line number Diff line number Diff line change
@@ -1,15 +1,14 @@
import os
from pathlib import Path

# HACK: suppress the warning about pytorch, jax, et al. from transformers import logging
# because we are only importing a tokenizer and using the transformers library for the
# underlying PreTrainedFastTokenizer functionality
os.environ["TRANSFORMERS_NO_ADVISORY_WARNINGS"] = "1"

from pathlib import Path

from transformers import PreTrainedTokenizerFast
from transformers import PreTrainedTokenizerFast # noqa: E402

from pgn_tokenizer.constants import DATASET_NAME
from pgn_tokenizer.constants import DATASET_NAME # noqa: E402

base_path = Path(__file__).parent

Expand All @@ -23,7 +22,6 @@ def __init__(self):
self.tokenizer = PreTrainedTokenizerFast(
tokenizer_file=str(tokenizer_config_path)
)
self.vocab = self.tokenizer.get_vocab()
self.vocab_size = self.tokenizer.vocab_size
self.encode = self.tokenizer.encode
self.decode = self.tokenizer.decode
Loading

0 comments on commit acf3c2c

Please sign in to comment.