Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Tokenizers] add max_lengh parametrisation to encode #1518

Draft
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

pavel-esir
Copy link
Contributor

Works in collaboration with tokenizers changes openvinotoolkit/openvino_tokenizers#362
Ticket: CVS-157356

@pavel-esir pavel-esir added the enhancement New feature or request label Jan 9, 2025
@pavel-esir pavel-esir added this to the 2025.0 milestone Jan 9, 2025
@github-actions github-actions bot added category: tokenizers Tokenizer class or submodule update category: Python API Python API for GenAI category: GenAI C++ API Changes in GenAI C++ public headers labels Jan 9, 2025
return {std::numeric_limits<int32_t>::max(), std::numeric_limits<int32_t>::max()};
} else if (padding_mode == "max_length") {
return {std::numeric_limits<int32_t>::max(), max_length};
} else if (padding_mode == "do_not_pad") {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is string, but not enum?

} else if (padding_mode == "max_length") {
return {std::numeric_limits<int32_t>::max(), max_length};
} else if (padding_mode == "do_not_pad") {
// bahves exactly as longest
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// bahves exactly as longest
// behaves exactly as longest

# Calling encode with 'add_special_tokens' will set state flag.
ov_res = genai_tokenzier.encode(prompt, add_special_tokens=add_special_tokens, max_length=max_length, padding_mode=pad_mode).input_ids.data
hf_res = hf_tokenizer(prompt, return_tensors="np", add_special_tokens=add_special_tokens, max_length=max_length, padding=pad_mode)["input_ids"]
assert np.all(ov_res == hf_res)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need to test attention mask as well?
is attention_mask's tail filled with zeros in case of padding?

@ilya-lavrenov ilya-lavrenov changed the title add max_lengh parametrisation to encode [Tokenizers] add max_lengh parametrisation to encode Jan 11, 2025
@andrei-kochin andrei-kochin modified the milestones: 2025.0, 2025.1 Jan 13, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
category: GenAI C++ API Changes in GenAI C++ public headers category: Python API Python API for GenAI category: tokenizers Tokenizer class or submodule update enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants