Tok

This is the code of the first version of my own tokenizer, called Tok, based on Byte Pair Encoding (BPE).

What is a Tokenizer?

A tokenizer is a fundamental component in natural language processing that breaks down raw text into smaller units called tokens. These tokens can be words, subwords, or characters, and are converted into numerical representations that machine learning models can process. The quality of tokenization significantly impacts model performance, affecting everything from training efficiency to the model's ability to understand context and generate coherent text.

Version History

Tok-1: Currently the latest (and only) version of Tok, based on BPE with byte-level encoding and custom <|EOS|> token.

How to use Tok?

Using Tok for your project is super simple: you just need to download the json file of Tok-1 directly from Github or just typing in the terminal

curl -O https://raw.githubusercontent.com/gianndev/Tok/master/tok-1/tok1.json

and the you can use it in your project using Hugging Face's tokenizers library by just inserting in your code

from tokenizers import Tokenizer

tokenizer = Tokenizer.from_file("path/to/tok1.json")

License

This project is licensed under the terms of the MIT License. See the LICENSE file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 25 Commits
tok-1		tok-1
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Tok

What is a Tokenizer?

Version History

How to use Tok?

License

About

Uh oh!

Languages

License

gianndev/Tok

Folders and files

Latest commit

History

Repository files navigation

Tok

What is a Tokenizer?

Version History

How to use Tok?

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Languages