Support for gpt-turbo-3.5 cl100k_base encoding #6

deepak-coding-art · 2023-03-19T15:49:59Z

Does the package support the cl100k_base encoding with is used in chat-gpt

syonfox · 2023-03-20T18:08:42Z

hello world
Encoded: [31373,995]

https://github.com/openai/tiktoken/blob/main/tests/test_simple_public.py

This matches the gpt2 encoding schema. SO it is probably not the same and we would need the updated vocab.bpe

and encoder map to support the new version this is probably also why the encoding length is off. I would say its probably fine to use for estimation but I would not rely on it for the more complicated models until We can find and implement the new version that is used for embeddings.

https://news.ycombinator.com/item?id=34008839

some more info on tokenizers used by openai

syonfox · 2023-03-20T18:17:58Z

TODO extract for 100k data from tokenizer, compile the rust lib to webasemably or reimplement in c++

I am going to put this on hold because it's a good enough approximation for front-end user input validation if we add a 5-10% buffer but it would be nice to have a js implementation of the new version. If anyone wants to help with that I would appreciate it. Even just building the python version and dumping the data so some json files ...

Thanks

syonfox · 2023-04-17T23:11:30Z

7013e40

I have added the data from

https://community.openai.com/t/how-do-you-make-a-bpe-file-for-tokenizer/94752/13

https://github.com/blinkdata/c-tokenizer

We still need to process this into js and add a new Encoder.js

dbjpanda · 2023-06-10T14:46:05Z

Hi any update on this please?

syonfox · 2023-09-26T08:19:33Z

The repository linked jas a ts script for loading the vocab files

It's worth stealing some of the implementation as the js project looks relatively clean. just the tooling is a bit bloated

the encoding seems to be here

https://github.com/dqbd/tiktoken/blob/072dd12962cabeca67c5088e3d8a8d006af19482/scripts/ranks.ts#L5

syonfox added enhancement New feature or request help wanted Extra attention is needed labels Mar 20, 2023

syonfox mentioned this issue Mar 20, 2023

Count seems to be off by 5-8 tokens #5

Open

syonfox mentioned this issue Sep 25, 2023

Unusable and does not match with token output from GPT-3 latitudegames/GPT-3-Encoder#9

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support for gpt-turbo-3.5 cl100k_base encoding #6

Support for gpt-turbo-3.5 cl100k_base encoding #6

deepak-coding-art commented Mar 19, 2023

syonfox commented Mar 20, 2023

syonfox commented Mar 20, 2023

syonfox commented Apr 17, 2023

dbjpanda commented Jun 10, 2023

syonfox commented Sep 26, 2023

Support for gpt-turbo-3.5 cl100k_base encoding #6

Support for gpt-turbo-3.5 cl100k_base encoding #6

Comments

deepak-coding-art commented Mar 19, 2023

syonfox commented Mar 20, 2023

syonfox commented Mar 20, 2023

syonfox commented Apr 17, 2023

dbjpanda commented Jun 10, 2023

syonfox commented Sep 26, 2023