This is the code of the first version of my own tokenizer, called Tok, based on Byte Pair Encoding (BPE).
A tokenizer is a fundamental component in natural language processing that breaks down raw text into smaller units called tokens. These tokens can be words, subwords, or characters, and are converted into numerical representations that machine learning models can process. The quality of tokenization significantly impacts model performance, affecting everything from training efficiency to the model's ability to understand context and generate coherent text.
- Tok-1: Currently the latest (and only) version of Tok, based on BPE with byte-level encoding and custom
<|EOS|>
token.
Using Tok for your project is super simple: you just need to download the json file of Tok-1 directly from Github or just typing in the terminal
curl -O https://raw.githubusercontent.com/gianndev/Tok/master/tok-1/tok1.json
and the you can use it in your project using Hugging Face's tokenizers
library by just inserting in your code
from tokenizers import Tokenizer
tokenizer = Tokenizer.from_file("path/to/tok1.json")
This project is licensed under the terms of the MIT License. See the LICENSE file for details.