-
Notifications
You must be signed in to change notification settings - Fork 999
Providing byte level offsets for effective alignment in Cross-Tokenizer On-Policy Distillation #1880
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR adds an offset_type parameter to the encode and encode_batch methods, allowing users to choose between character-based offsets ("char"), byte-based offsets ("byte"), or no offsets ("none") for faster encoding. The default is "char" to maintain backward compatibility.
- Adds
offset_typeparameter to both Rust and Python encoding methods - Routes to appropriate internal methods based on offset type selection
- Provides input validation with helpful error messages
Reviewed Changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
| bindings/python/src/tokenizer.rs | Adds offset_type parameter to encode and encode_batch Rust methods with validation and routing logic |
| bindings/python/py_src/tokenizers/implementations/base_tokenizer.py | Adds offset_type parameter to Python wrapper methods with documentation |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
ArthurZucker
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, if you want to expose it, it will be better to just add encode_char_offsets to the bindings (its less breaking and no need for api changes)
If that works for you, python stub.py will update the inits
| pair: Optional[InputSequence] = None, | ||
| is_pretokenized: bool = False, | ||
| add_special_tokens: bool = True, | ||
| offset_type: str = "char", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd rather we make it optional!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
btw this file is not really something I thought had a lot of usage 😄
WDYT about rather adding some doc in a md file or something? As this already exists?
issue
#1881
Our team tried different aligning solutions when implementing our self-developed On-Policy distillation method, including trl's existing implementation (which repeatedly calls the decode() method but encounters correctness issues in special cases and has high computational overhead).
Later we found a better method: by making one call to the encode() method to get byte-level offsets for all tokens, we can effectively avoid BPE's complexity, and byte level offsets are also compatible with all other types of tokenizers. Additionally, for distillation between two BPE tokenizers, we can get more accurate alignment by skipping string as an intermediate modality.
Therefore, we hope to merge this simple patch to expose the byte-level offset calculation already supported in the Rust code for use by Python classes.
More description at:
huggingface/trl#4393