Provide byte-level offsets for effective alignment in Cross-Tokenizer On-Policy Distillation

APOLOGIZE: Sorry for submitting a PR several hours ago without creating an issue firstly. I'm urgently coming back here to initiate an issue.

Motivation: While performing cross-tokenizer on-policy distillation, we need to align two different encoding results (by two tokenizers) of the text, which is rolled out by the student model. To efficiently implement this, we hope the encode() function can provide **byte-level offset** information, so that the alignment method only needs to call the encode() method once.

This scenario requires byte-level offsets rather than character-level offsets to achieve the most compatible cross-tokenizer alignment.

According to my understanding of current tokenizers project, the Rust code already supports this functionality but has not exposed it to the Python encode() method.

https://github.com/huggingface/tokenizers/blob/007fc767ac78fe4ec43c6790afbee5063157fc62/bindings/python/src/tokenizer.rs#L1070-L1076


More description at:
https://github.com/huggingface/trl/issues/4393

PR:
https://github.com/huggingface/tokenizers/pull/1880


	fn encode(
	&self,
	sequence: &Bound<'_, PyAny>,
	pair: Option<&Bound<'_, PyAny>>,
	is_pretokenized: bool,
	add_special_tokens: bool,
	) -> PyResult<PyEncoding> {

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Provide byte-level offsets for effective alignment in Cross-Tokenizer On-Policy Distillation #1881

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Provide byte-level offsets for effective alignment in Cross-Tokenizer On-Policy Distillation #1881

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions