Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add support for utf8 input and output #46

Open
abdulkareemnalband opened this issue Sep 13, 2024 · 2 comments
Open

add support for utf8 input and output #46

abdulkareemnalband opened this issue Sep 13, 2024 · 2 comments

Comments

@abdulkareemnalband
Copy link

add support for utf8 input and output
Proposed API is

public List<int> EncodeFromUtf8(ReadOnlySpan<byte> lineToEncode, ISet<ReadOnlySpan<byte>> allowedSpecial = null, ISet<ReadOnlySpan<byte>> disallowedSpecial = null);
public byte[] DecodeToUtf8(IEnumerable<int> inputTokensToDecode);
@dmitry-brazhenko
Copy link
Owner

dmitry-brazhenko commented Sep 14, 2024

Hi!

Thanks for reaching out.
Why do you need this?

@abdulkareemnalband
Copy link
Author

We are currently implementing a process wherein certain tokens are substituted with alternative tokens in the OpenAI request, and subsequently restored in the response. This method has been adopted as a strategy to minimize the total number of tokens utilized.

To facilitate this process, we are using the SharpToken library. However, we have encountered an issue related to encoding, arising due to the fact that the OpenAI API accepts and returns data in the UTF-8 format, whereas our replacements are causing discrepancies when mapped onto C# UTF-16 strings.

As a temporary solution, we have been extracting the BytePairEncodingCore from GptEncoding using reflection, and invoking the DecodeNative function on it. This has been providing the expected results.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants