On the Convergence of Attention Positions in SigLIP Across Different Images: A Contrast with CLIP

In your paper, the figure 11 "Visualization of Redundancy in the SigLIP Model" shows the visualization of the attention maps of SigLIP. Why do the attentions of SigLIP roughly focus on the same positions for different images, while this problem does not occur in CLIP?