Minimal working example illustrating the use of CLIP (Contrastive Language-Image Pre-Training) embeddings.
The example uses (image, caption) pairs from Google's Conceptual Captions dataset. Data is available via the Huggingface Hub. CLIP is available via the official implementation from OpenAI at https://github.com/openai/CLIP.
In the example, both images and captions are embedded using CLIP and then embeddings are projected to a low-dimensional space via UMAP.