Open
Description
To Reproduce
from keras_hub import models
tokenizer = models.Tokenizer.from_preset(
"clip_vit_h_14_laion2b_s32b_b79k",
sequence_length=77,
pad_with_end_token=True,
)
tokenizer = models.CLIPPreprocessor(tokenizer, sequence_length=77)
tokenizer(["a cat sitting on the table"])
which returns
{'token_ids': <tf.Tensor: shape=(1, 77), dtype=int32, numpy=
array([[49406, 320, 2368, 4919, 525, 518, 2175, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 49407]], dtype=int32)>,
'padding_mask': <tf.Tensor: shape=(1, 77), dtype=bool, numpy=
array([[ True, True, True, True, True, True, True, True, True,
True, True, True, True, True, True, True, True, True,
True, True, True, True, True, True, True, True, True,
True, True, True, True, True, True, True, True, True,
True, True, True, True, True, True, True, True, True,
True, True, True, True, True, True, True, True, True,
True, True, True, True, True, True, True, True, True,
True, True, True, True, True, True, True, True, True,
True, True, True, True, True]])>}
This is surprising because of a few reasons. First, even if pad_with_end_token=True
, the pad is using 0 (which correspond to !
in this vocabulary). Also, the end token is added at the end of the padding instead of the end of the original sequence.
Further, padding_mask
is all True, while I would expect to be False in correspondence of padding tokens.
Additional context
Using keras_hub==0.18.1
, keras==3.7.0
.