A python package to build a corpus vocabulary using the byte pair methodology and also a tokenizer to tokenize input texts based on the built vocab.
nlp natural-language-processing tokenizer vocabulary nlp-library vocabulary-builder natural-language-understanding subword-units bpe bytepairencoding subwordtokenization subwordtokens
-
Updated
May 21, 2020 - Python