CVProject

The image encoder is a pre-trained facebook/DINOv2-base¹ model, that is capable of encoding images into a 768 dimensional vector sequence.

The overall architecture of the model is inspired by microsoft/GIT-base².

Refer to the demo notebook for a deeper explanation of the model and its components. You can find it here.

Demo

You can run the demo notebook in Google Colab by clicking the button below:

Here is a sample of the captions generated by the model with all the algorithms implemented:

DINOv2: Learning Robust Visual Features without Supervision Oquab et al. 2024 ↩
GIT: A Generative Image-to-text Transformer for Vision and Language Wang et al. 2022 ↩

Name		Name	Last commit message	Last commit date
Latest commit History 44 Commits
CVProject		CVProject
docs		docs
.gitignore		.gitignore
LICENSE		LICENSE
colab.ipynb		colab.ipynb
pyproject.toml		pyproject.toml