Skip to content

Commit 211f5d8

Browse files
committed
feat: pretrained version 1.2.1
1 parent 7d82651 commit 211f5d8

File tree

2 files changed

+26
-4
lines changed

2 files changed

+26
-4
lines changed

README.md

Lines changed: 25 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
# bpetokenizer
22

3-
A Byte Pair Encoding (BPE) tokenizer, which algorithmically follows along the GPT tokenizer. The tokenizer is capable of handling special tokens and uses a customizable regex pattern for tokenization(includes the gpt4 regex pattern). supports `save` and `load` tokenizers in the `json` and `file` format.
3+
A Byte Pair Encoding (BPE) tokenizer, which algorithmically follows along the GPT tokenizer(tiktoken), allows you to train your own tokenizer. The tokenizer is capable of handling special tokens and uses a customizable regex pattern for tokenization(includes the gpt4 regex pattern). supports `save` and `load` tokenizers in the `json` and `file` format. The `bpetokenizer` also supports [pretrained](bpetokenizer/pretrained/) tokenizers.
44

55

66
### Overview
@@ -31,7 +31,7 @@ Every LLM(LLama, Gemini, Mistral..) use their own Tokenizers trained on their ow
3131

3232
2. [BPETokenizer](bpetokenizer/tokenizer.py): This class emphasizes the real power of the tokenizer(used in gpt4 tokenizer..[tiktoken](https://github.com/openai/tiktoken)), uses the `GPT4_SPLIT_PATTERN` to split the text as mentioned in the gpt4 tokenizer. also handles the `special_tokens` (refer [sample_bpetokenizer](sample/bpetokenizer/sample_bpetokenizer.py)). which inherits the `save` and `load` functionlities to save and load the tokenizer respectively.
3333

34-
3. [PreTrained Tokenizer](pretrained/wi17k_base.json): PreTrained Tokenizer wi17k_base, has a 17316 vocabulary. trained with the wikitext dataset (len: 1000000). with 6 special_tokens.
34+
3. [PreTrained Tokenizer](bpetokenizer/pretrained/wi17k_base): PreTrained Tokenizer wi17k_base, has a 17316 vocabulary. trained with the wikitext dataset (len: 1000000). with 6 special_tokens.
3535

3636

3737
### Usage
@@ -121,6 +121,28 @@ print("tokens: ", tokens)
121121
```
122122
refer to the [load_json_vocab](sample/load_json_vocab/) and run the `bpetokenizer_json` to get an overview of `vocab`, `merges`, `special_tokens` and to view the tokens that are split by the tokenizer using pattern, look at [tokens](sample/load_json_vocab/tokens.py)
123123

124+
125+
#### To load the pretrained tokenizers
126+
127+
```py
128+
from bpetokenizer import BPETokenzier
129+
130+
tokenizer = BPETokenizer.from_pretrained("wi17k_base", verbose=True)
131+
132+
texts = """
133+
def get_stats(tokens, counts=None) -> dict:
134+
"Get statistics of the tokens. Includes the frequency of each consecutive pair of tokens"
135+
counts = if counts is None else counts
136+
for pair in zip(tokens, tokens[1:]):
137+
counts[pair] = counts.get(pair, 0) + 1
138+
return counts
139+
"""
140+
tokenizer.tokens(texts, verbose=True)
141+
142+
```
143+
for now, we only have a single 17k vocab tokenizer `wi17_base` at [pretrained](/bpetokenizer/pretrained/)
144+
145+
124146
### Run Tests
125147

126148
the tests folder `tests/` include the tests of the tokenizer, uses pytest.
@@ -138,7 +160,7 @@ Contributions to the BPE Tokenizer are most welcomed! If you would like to contr
138160

139161
- Star and Fork the repository.
140162
- Create a new branch (git checkout -b feature/your-feature).
141-
- Commit your changes (git commit -am 'Add some feature').
163+
- Commit your changes (git commit -m 'Add some feature').
142164
- Push to the branch (git push origin feature/your-feature).
143165
- Create a new Pull Request.
144166

setup.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -17,7 +17,7 @@
1717
setup(
1818
name="bpetokenizer",
1919
version=__version__,
20-
description="Byte Pair Encoding Tokenizer with special tokens and regex pattern",
20+
description="A Byte Pair Encoding (BPE) tokenizer, which algorithmically follows along the GPT tokenizer(tiktoken), allows you to train your own tokenizer. The tokenizer is capable of handling special tokens and uses a customizable regex pattern for tokenization(includes the gpt4 regex pattern). supports `save` and `load` tokenizers in the `json` and `file` format. The `bpetokenizer` also supports [pretrained](bpetokenizer/pretrained/) tokenizers.",
2121
long_description=long_description,
2222
long_description_content_type="text/markdown",
2323
url="https://github.com/Hk669/bpetokenizer",

0 commit comments

Comments
 (0)