Skip to content

gianndev/Tok

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

25 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Tok

This is the code of the first version of my own tokenizer, called Tok, based on Byte Pair Encoding (BPE).

What is a Tokenizer?

A tokenizer is a fundamental component in natural language processing that breaks down raw text into smaller units called tokens. These tokens can be words, subwords, or characters, and are converted into numerical representations that machine learning models can process. The quality of tokenization significantly impacts model performance, affecting everything from training efficiency to the model's ability to understand context and generate coherent text.

Version History

  • Tok-1: Currently the latest (and only) version of Tok, based on BPE with byte-level encoding and custom <|EOS|> token.

How to use Tok?

Using Tok for your project is super simple: you just need to download the json file of Tok-1 directly from Github or just typing in the terminal

curl -O https://raw.githubusercontent.com/gianndev/Tok/master/tok-1/tok1.json

and the you can use it in your project using Hugging Face's tokenizers library by just inserting in your code

from tokenizers import Tokenizer

tokenizer = Tokenizer.from_file("path/to/tok1.json")

License

This project is licensed under the terms of the MIT License. See the LICENSE file for details.

About

Tok: my own Tokenizer

Topics

Resources

License

Stars

Watchers

Forks