Skip to content

Provide byte-level offsets for effective alignment in Cross-Tokenizer On-Policy Distillation #1881

@JqzChandler

Description

@JqzChandler

APOLOGIZE: Sorry for submitting a PR several hours ago without creating an issue firstly. I'm urgently coming back here to initiate an issue.

Motivation: While performing cross-tokenizer on-policy distillation, we need to align two different encoding results (by two tokenizers) of the text, which is rolled out by the student model. To efficiently implement this, we hope the encode() function can provide byte-level offset information, so that the alignment method only needs to call the encode() method once.

This scenario requires byte-level offsets rather than character-level offsets to achieve the most compatible cross-tokenizer alignment.

According to my understanding of current tokenizers project, the Rust code already supports this functionality but has not exposed it to the Python encode() method.

fn encode(
&self,
sequence: &Bound<'_, PyAny>,
pair: Option<&Bound<'_, PyAny>>,
is_pretokenized: bool,
add_special_tokens: bool,
) -> PyResult<PyEncoding> {

More description at:
huggingface/trl#4393

PR:
#1880

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions