Skip to content

huangwei021230/tokenizer

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

25 Commits
 
 
 
 
 
 
 
 

Repository files navigation

GPT2Tokenizer

GPT2Tokenizer is a C++ library that implements the GPT-2 tokenizer. It provides functionality to tokenize text into subwords, which is useful for natural language processing tasks such as language modeling and text generation.

Features

  • Tokenizes text into subwords based on the GPT-2 tokenizer algorithm.
  • (ongoing) Supports various tokenization options, such as lowercasing, truncation, and padding.
  • (ongoing) Provides methods to convert tokens back to text.

Installation (ongoing)

To use GPT2Tokenizer in your C++ project, follow these steps:

  1. Clone the repository: git clone https://github.com/your-username/GPT2Tokenizer.git
  2. Build the library using your preferred C++ build system.
  3. Include the necessary header files in your project.
  4. Link against the GPT2Tokenizer library.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published