Skip to content

Latest commit

 

History

History
34 lines (21 loc) · 1.03 KB

README.md

File metadata and controls

34 lines (21 loc) · 1.03 KB

Nanoscale tokenizer in C++

Nanoscale tokenizer in C++. Currently RWKV world tokenizer is implemented.

Features

  • Easy to embed
  • Read vocab from JSON(through minijson)

Variants

  • Naiive Trie tree implementation : rwkv_world_tokenizer_trie.hh
  • Efficient version using hat-trie : rwkv_world_tokenizer_hat.hh
  • Efficient version using cedar : rwkv_world_tokenizer_cedar.hh

If you want to run tokenizer with no C++ exception(e.g. WASM), naiive or cedar version recommended to use.

Additional feature to original RWKV world tokenizer.

  • UTF-8 byte fallback

TODO

  • Make C++ Exception free

Third party libraries