Paper accepted for CVPR 2026.
arXiv & code soon!
What is the "Intra-Modal Misalignment Hypothesis"?
Recent research has indicated that the embeddings generated by contrastive language-image training like CLIP may not be ideal for image-only tasks. The main theory is that the inter-modal (language-image) alignment loss ignores intra-modal (image-image) alignment, leading to poorly calibrated similarities between images. [1,2,3]
What is our "Reevaluation"?
In this study, we question this intra-modal misalignment hypothesis. We reexamine the theoretical arguments and techniques that seek to demonstrate the misalignment.
Our findings reveal that neither the distribution of cosine similarities nor few-shot or retrieval metrics serve as reliable indicators of misalignment. In fact, these metrics yield similar results for language-image trained models (CLIP, SigLIP) and image-image trained models (DINO, SigLIP2), which indicates there is no intra-modal misalignment stemming from contrastive language-image training. We argue the observed phenomena can be explained without assuming a fundamental flaw in the image embedding space. Experiments on the commonly studied intra-modal tasks retrieval and few-shot classification confirm that addressing supposed misalignment is unnecessary for achieving strong performance.
References
[1] Mistretta et al.: "Cross the gap: Exposing the intra-modal misalignment in clip via modality inversion", ICLR 2025.
[2] Yi et al.: "Leveraging cross-modal neighbor representation for improved clip classification", CVPR 2024.
[3] Udandarao et al.: "Training-free name-only transfer of vision-language models", CVPR 2023.