chore: add pre-commit hooks

DVDAGames · Jan 24, 2025 · acf3c2c · acf3c2c
1 parent 94d2be1
commit acf3c2c
Show file tree

Hide file tree

Showing 10 changed files with 352 additions and 6 deletions.
diff --git a/.pre-commit-config.yaml b/.pre-commit-config.yaml
@@ -0,0 +1,55 @@
+default_install_hook_types: [pre-commit, commit-msg, pre-push]
+
+repos:
+  - repo: https://github.com/pre-commit/pre-commit-hooks
+    rev: v5.0.0
+    hooks:
+      - id: check-merge-conflict
+        name: No merge conflict markers
+        stages: [pre-commit]
+      - id: mixed-line-ending
+        name: No mixed line endings (LF and CRLF)
+        stages: [pre-commit]
+
+  - repo: https://github.com/psf/black
+    rev: 22.6.0
+    hooks:
+      - id: black
+        stages: [pre-commit]
+        types: [python]
+
+  - repo: https://github.com/charliermarsh/ruff-pre-commit
+    rev: v0.0.262
+    hooks:
+      - id: ruff
+        args: [--fix]
+        stages: [pre-commit]
+        types: [python]
+
+  - repo: local
+    hooks:
+      - id: mypy-all
+        name: MyPy (all)
+        pass_filenames: false
+        files: pgn-tokenizer/
+        entry: uv run mypy . --disable-error-code=import-untyped
+        stages: [pre-push]
+        language: system
+      - id: pytest
+        name: pytest
+        stages: [pre-push]
+        language: system
+        entry: uv run pytest
+        types: [python]
+        pass_filenames: false
+        verbose: true
+
+  # conventional commits
+  - repo: https://github.com/espressif/conventional-precommit-linter
+    rev: v1.6.0
+    hooks:
+      - id: conventional-precommit-linter
+        stages: [commit-msg]
+        args:
+          - --types=chore,feat,fix,ci,docs,refactor,revert,test
+          - --allow-breaking
diff --git a/README.md b/README.md
@@ -2,3 +2,74 @@
 
 This is a Byte Pair Encoding (BPE) tokenizer for chess Portable Game Notation (PGN).
 
+It is uses the [`tokenizers`](https://huggingface.co/docs/tokenizers/) library from Hugging Face for training the tokenizer and the [`transformers`](https://huggingface.co/docs/transformers/) library from Hugging Face for initializing the tokenizer from the pretrained tokenizer model for faster tokenization.
+
+**Note**: This is part of a work-in-progress project to investigate how language models might understand chess without an engine or any chess-specific knowledge.
+
+## Tokenizer Comparison
+
+More traditional, language-focused BPE tokenizer implementations are not suited for PGN strings because they are more likely to break the actual moves apart.
+
+For example `1.e4 Nf6` would likely be tokenized as `1`, `.`, `e`, `4`, ` N`, `f`, `6` or `1`, `.e`, `4`, ` `, ` N`, `f`, `6` depending on the tokenizer's vocabulary, but with the specialized PGN tokenizer it would be tokenized as `1.`, `e4`, ` Nf6`.
+
+### Visualization
+
+Here is a visualization of the vocabulary of this specialized PGN tokenizer compared to the BPE tokenizer vocabularies of the `cl100k_base` (the vocabulary for the `gpt-3.5-turbo` and `gpt-4` models' tokenizer) and the `o200k_base` (the vocabulary for the `gpt-4o` model's tokenizer):
+
+#### PGN Tokenizer
+
+![PGN Tokenizer Visualization](./docs/assets/pgn-tokenizer.png)
+
+**Note**: The tokenizer was trained with ~2.8 Million chess games in PGN notation with a target vocabulary size of `4096`.
+
+#### GPT-3.5-turbo and GPT-4 Tokenizers
+
+![GPT-4 Tokenizer Visualization](./docs/assets/gpt-4-tokenizer.png)
+
+#### GPT-4o Tokenizer
+
+![GPT-4o Tokenizer Visualization](./docs/assets/gpt-4o-tokenizer.png)
+
+These were all generated with a function adapted from an [educational Jupyter Notebook in the `tiktoken` repository](https://github.com/openai/tiktoken/blob/main/tiktoken/_educational.py#L186).
+
+## Installation
+
+You can install it with your package manager of choice:
+
+### uv
+
+```bash
+uv add pgn-tokenizer
+```
+
+### pip
+
+```bash
+pip install pgn-tokenizer
+```
+
+## Usage
+
+It exposes a simple interface with `.encode()` and `.decode()` methods, and a `.vocab_size` property, but you can also access the underlying `PreTrainedTokenizerFast` class from the `transformers` library via the `.tokenizer` property.
+
+```python
+from pgn_tokenizer import PGNTokenizer
+
+# Initialize the tokenizer
+tokenizer = PGNTokenizer()
+
+# Tokenize a PGN string
+tokens = tokenizer.encode("1.e4 Nf6 2.e5 Nd5 3.c4 Nb6")
+
+# Decode the tokens back to a PGN string
+decoded = tokenizer.decode(tokens)
+
+# get vocab from underlying tokenizer class
+vocab = tokenizer.tokenizer.get_vocab()
+```
+
+## Acknowledgements
+
+- [@karpathy](https://github.com/karpathy) for the [Let's build the GPT Tokenizer tutorial](https://youtu.be/zduSFxRajkE)
+- [Hugging Face](https://huggingface.co/) for the [`tokenizers`](https://huggingface.co/docs/tokenizers/) and [`transformers`](https://huggingface.co/docs/transformers/) libraries.
+- Kaggle user [MilesH14](https://www.kaggle.com/milesh14), whoever you are for the now-missing dataset of 3.5 million chess games referenced in many places, including this [research documentation](https://chess-research-project.readthedocs.io/en/latest/)
diff --git a/docs/CONTRIBUTING.md b/docs/CONTRIBUTING.md
@@ -0,0 +1,41 @@
+# Contributor Guide
+
+**Note**: More coming soon. This is a work-in-progress and the underlying dataset was just deleted from Kaggle.
+
+1. [Fork the `DVDAGames/pgn-tokenizer` repository](https://github.com/DVDAGames/pgn-tokenizer/fork):
+2. Clone your fork:
+
+```bash
+git clone git@github.com:<your-username>/pgn-tokenizer.git
+```
+
+3. Install dependencies:
+
+```bash
+uv install
+```
+
+4. Make your changes:
+5. Test your changes:
+
+```bash
+uv run pytest
+```
+
+6. Commit your changes using [Conventional Commits](https://www.conventionalcommits.org/)
+7. Push to your fork:
+
+```bash
+git push origin <your-branch>
+```
+
+8. Create a [Pull Request](https://github.com/DVDAGames/pgn-tokenizer/compare)
+9. Wait for review, approval, and a virtual high five.
+
+## Development Scripts
+
+There are a few scripts in the `scripts` directory that can be useful for development:
+
+- `clean-dataset.py`: Cleans the original weirdness out of the dataset PGN notation
+- `format-dataset-for-training.py`: Formats the cleaned dataset for tokenizer training by adding the special `[g_start]` and `[g_end]` tokens to the beginning and end of each game
+- `train.py`: Trains the tokenizer on the formatted dataset and saves the model
diff --git a/docs/assets/gpt-4-tokenizer.png b/docs/assets/gpt-4-tokenizer.png
diff --git a/docs/assets/gpt-4o-tokenizer.png b/docs/assets/gpt-4o-tokenizer.png
diff --git a/docs/assets/pgn-tokenizer.png b/docs/assets/pgn-tokenizer.png
diff --git a/pyproject.toml b/pyproject.toml
@@ -25,8 +25,12 @@ build-backend = "hatchling.build"
 
 [dependency-groups]
 dev = [
+    "black>=24.10.0",
     "datasets>=3.2.0",
     "kagglehub>=0.3.6",
+    "mypy>=1.14.1",
     "polars>=1.20.0",
+    "pre-commit>=4.1.0",
+    "ruff>=0.9.3",
     "tokenizers>=0.21.0",
 ]
diff --git a/scripts/train.py b/scripts/train.py
@@ -17,7 +17,7 @@
 )
 
 TRAINING_DATA_PATH = f"./.data/datasets/{DATASET_NAME}/"
-OUTPUT_PATH = "./src/pgn_tokenizer"
+OUTPUT_PATH = "./src/pgn_tokenizer/config"
 
 FULL_DATASET_PATH = f"{TRAINING_DATA_PATH}/full"
 SAMPLE_DATASET_PATH = f"{TRAINING_DATA_PATH}/sample"

diff --git a/src/pgn_tokenizer/__init__.py b/src/pgn_tokenizer/__init__.py
@@ -1,15 +1,14 @@
 import os
+from pathlib import Path
 
 # HACK: suppress the warning about pytorch, jax, et al. from transformers import logging
 # because we are only importing a tokenizer and using the transformers library for the
 # underlying PreTrainedFastTokenizer functionality
 os.environ["TRANSFORMERS_NO_ADVISORY_WARNINGS"] = "1"
 
-from pathlib import Path
-
-from transformers import PreTrainedTokenizerFast
+from transformers import PreTrainedTokenizerFast  # noqa: E402
 
-from pgn_tokenizer.constants import DATASET_NAME
+from pgn_tokenizer.constants import DATASET_NAME  # noqa: E402
 
 base_path = Path(__file__).parent
 
@@ -23,7 +22,6 @@ def __init__(self):
         self.tokenizer = PreTrainedTokenizerFast(
             tokenizer_file=str(tokenizer_config_path)
         )
-        self.vocab = self.tokenizer.get_vocab()
         self.vocab_size = self.tokenizer.vocab_size
         self.encode = self.tokenizer.encode
         self.decode = self.tokenizer.decode