The image encoder is a pre-trained facebook/DINOv2-base1 model, that is capable of encoding images into a 768 dimensional vector sequence.
The overall architecture of the model is inspired by microsoft/GIT-base2.
Refer to the demo notebook for a deeper explanation of the model and its components. You can find it here.
You can run the demo notebook in Google Colab by clicking the button below:
Here is a sample of the captions generated by the model with all the algorithms implemented:
Footnotes
-
DINOv2: Learning Robust Visual Features without Supervision Oquab et al. 2024 ↩
-
GIT: A Generative Image-to-text Transformer for Vision and Language Wang et al. 2022 ↩