-
Notifications
You must be signed in to change notification settings - Fork 203
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Tokenizers] add max_lengh parametrisation to encode #1518
base: master
Are you sure you want to change the base?
Conversation
return {std::numeric_limits<int32_t>::max(), std::numeric_limits<int32_t>::max()}; | ||
} else if (padding_mode == "max_length") { | ||
return {std::numeric_limits<int32_t>::max(), max_length}; | ||
} else if (padding_mode == "do_not_pad") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is string, but not enum?
} else if (padding_mode == "max_length") { | ||
return {std::numeric_limits<int32_t>::max(), max_length}; | ||
} else if (padding_mode == "do_not_pad") { | ||
// bahves exactly as longest |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// bahves exactly as longest | |
// behaves exactly as longest |
# Calling encode with 'add_special_tokens' will set state flag. | ||
ov_res = genai_tokenzier.encode(prompt, add_special_tokens=add_special_tokens, max_length=max_length, padding_mode=pad_mode).input_ids.data | ||
hf_res = hf_tokenizer(prompt, return_tensors="np", add_special_tokens=add_special_tokens, max_length=max_length, padding=pad_mode)["input_ids"] | ||
assert np.all(ov_res == hf_res) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we need to test attention mask as well?
is attention_mask's tail filled with zeros in case of padding?
Works in collaboration with tokenizers changes openvinotoolkit/openvino_tokenizers#362
Ticket: CVS-157356