-
Notifications
You must be signed in to change notification settings - Fork 1k
Description
APOLOGIZE: Sorry for submitting a PR several hours ago without creating an issue firstly. I'm urgently coming back here to initiate an issue.
Motivation: While performing cross-tokenizer on-policy distillation, we need to align two different encoding results (by two tokenizers) of the text, which is rolled out by the student model. To efficiently implement this, we hope the encode() function can provide byte-level offset information, so that the alignment method only needs to call the encode() method once.
This scenario requires byte-level offsets rather than character-level offsets to achieve the most compatible cross-tokenizer alignment.
According to my understanding of current tokenizers project, the Rust code already supports this functionality but has not exposed it to the Python encode() method.
tokenizers/bindings/python/src/tokenizer.rs
Lines 1070 to 1076 in 007fc76
| fn encode( | |
| &self, | |
| sequence: &Bound<'_, PyAny>, | |
| pair: Option<&Bound<'_, PyAny>>, | |
| is_pretokenized: bool, | |
| add_special_tokens: bool, | |
| ) -> PyResult<PyEncoding> { |
More description at:
huggingface/trl#4393
PR:
#1880