Skip to content

CLIPTokenizer does not work as expected #2018

Open
@fdtomasi

Description

@fdtomasi

To Reproduce

from keras_hub import models
tokenizer = models.Tokenizer.from_preset(
    "clip_vit_h_14_laion2b_s32b_b79k", 
    sequence_length=77,
    pad_with_end_token=True,
)
tokenizer = models.CLIPPreprocessor(tokenizer, sequence_length=77)
tokenizer(["a cat sitting on the table"])

which returns

{'token_ids': <tf.Tensor: shape=(1, 77), dtype=int32, numpy=
 array([[49406,   320,  2368,  4919,   525,   518,  2175,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0, 49407]], dtype=int32)>,
 'padding_mask': <tf.Tensor: shape=(1, 77), dtype=bool, numpy=
 array([[ True,  True,  True,  True,  True,  True,  True,  True,  True,
          True,  True,  True,  True,  True,  True,  True,  True,  True,
          True,  True,  True,  True,  True,  True,  True,  True,  True,
          True,  True,  True,  True,  True,  True,  True,  True,  True,
          True,  True,  True,  True,  True,  True,  True,  True,  True,
          True,  True,  True,  True,  True,  True,  True,  True,  True,
          True,  True,  True,  True,  True,  True,  True,  True,  True,
          True,  True,  True,  True,  True,  True,  True,  True,  True,
          True,  True,  True,  True,  True]])>}

This is surprising because of a few reasons. First, even if pad_with_end_token=True, the pad is using 0 (which correspond to ! in this vocabulary). Also, the end token is added at the end of the padding instead of the end of the original sequence.
Further, padding_mask is all True, while I would expect to be False in correspondence of padding tokens.

Additional context
Using keras_hub==0.18.1, keras==3.7.0.

Metadata

Metadata

Assignees

Labels

type:BugSomething isn't working

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions