Skip to content

Etto48/CVProject

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

44 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

CVProject

Model graph

The image encoder is a pre-trained facebook/DINOv2-base1 model, that is capable of encoding images into a 768 dimensional vector sequence.

The overall architecture of the model is inspired by microsoft/GIT-base2.

Refer to the demo notebook for a deeper explanation of the model and its components. You can find it here.

Demo

You can run the demo notebook in Google Colab by clicking the button below:

Open In Colab

Here is a sample of the captions generated by the model with all the algorithms implemented:

Sample captions

Footnotes

  1. DINOv2: Learning Robust Visual Features without Supervision Oquab et al. 2024

  2. GIT: A Generative Image-to-text Transformer for Vision and Language Wang et al. 2022

Releases

No releases published

Packages

No packages published