Reevaluating the Intra-Modal Misalignment Hypothesis in CLIP

Paper accepted for CVPR 2026.

arXiv & code soon!

Abstract

What is the "Intra-Modal Misalignment Hypothesis"?

Recent research has indicated that the embeddings generated by contrastive language-image training like CLIP may not be ideal for image-only tasks. The main theory is that the inter-modal (language-image) alignment loss ignores intra-modal (image-image) alignment, leading to poorly calibrated similarities between images. [1,2,3]

What is our "Reevaluation"?

In this study, we question this intra-modal misalignment hypothesis. We reexamine the theoretical arguments and techniques that seek to demonstrate the misalignment.

Our findings reveal that neither the distribution of cosine similarities nor few-shot or retrieval metrics serve as reliable indicators of misalignment. In fact, these metrics yield similar results for language-image trained models (CLIP, SigLIP) and image-image trained models (DINO, SigLIP2), which indicates there is no intra-modal misalignment stemming from contrastive language-image training. We argue the observed phenomena can be explained without assuming a fundamental flaw in the image embedding space. Experiments on the commonly studied intra-modal tasks retrieval and few-shot classification confirm that addressing supposed misalignment is unnecessary for achieving strong performance.

References

[1] Mistretta et al.: "Cross the gap: Exposing the intra-modal misalignment in clip via modality inversion", ICLR 2025.
[2] Yi et al.: "Leveraging cross-modal neighbor representation for improved clip classification", CVPR 2024.
[3] Udandarao et al.: "Training-free name-only transfer of vision-language models", CVPR 2023.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Reevaluating the Intra-Modal Misalignment Hypothesis in CLIP

Abstract

About

Uh oh!

Releases

Packages

Vision-Kek/Is-CLIP-Really-Misaligned

Folders and files

Latest commit

History

Repository files navigation

Reevaluating the Intra-Modal Misalignment Hypothesis in CLIP

Abstract

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Packages