When using the EyeCLIP model, I found that the text encoder in eyeclip_visual.pt has issues:
The computed text similarity between any texts is always 1.
It seems the text encoder weights are incomplete or corrupted.
Load the model with the official EyeCLIP code:
import eyeclip
import torch
device = "cuda"
eyeclip_model, eyeclip_preprocess = eyeclip.load("ViT-B/32", device=device, jit=False)
Load weights
weights_path = "./eyeclip_visual.pt"
eyeclip_model.load_state_dict(torch.load(weights_path))
eyeclip_model.eval()
Test text similarity
text_features = eyeclip_model.encode_text(["hello", "world"])
similarity = (text_features @ text_features.T)
print(similarity)
The output is always:
tensor([[1., 1.],
[1., 1.]])
Expected behavior
The text encoder should produce distinguishable embeddings for different texts.
Ideally, provide the full weights ./CLIP_ft_all_key_06-30-1427 to replace the current eyeclip_visual.pt.