Visual and Vision-Language Representation Pre-Training with Contrastive Learning

In this repository, I present my approach as a family of vision-language foundation systems. These systems, which are considered among the most advanced in the field of artificial intelligence, are used to solve a variety of important tasks, such as generation, retrieval, and classification tasks. The basis of the vision-language systems is a combination of pre-trained encoder models of computer vision and natural language processing.

Setup

I used Python 3.7 and PyTorch 1.7.1 to train and test my models. It also requires the installation of timm and transformers packages with the following versions:

$ pip install timm==0.6.7
$ pip install transformers

Training the model:

python main.py

During this research, I trained our models on several image encoder models, such as deit3, efficientnet_b8, convnext, swinv2, and fbnetv3, as well as text encoder models, such as roberta, xlm-roberta, xlnet, albert, electra, and bert, on a small-scale dataset. I concluded that changing image and text encoder models would necessarily change the efficiency of the model. I was able to show that encoder models like ConvNeXt and RoBERTa produce better results than more popular encoder models like BERT and ViT. I also found that training our model on a small-scale data set produced accurate results.

Below in this table are the best results I got after testing more than 80 pairs of image and text encoders in 20 epochs.

Image and Text Encoders	Top-1 accuracy	Top-5 accuracy
Swin + BERT	13.99	31.65
Swin + ALBERT	13.28	29.04
Swin + RoBERTa	14.83	32.64
ConvNeXt + BERT	16.63	35.31
ConvNeXt + ALBERT	13.46	29.46
ConvNeXt + RoBERTa	17.31	35.31

The model composed by the ConvNeXt and RoBERTa encoders has achieved satisfactory results on image-text retrieval tasks on some datasets, such as the ImageNetV2 dataset, the Unsplash dataset, and more other datasets, using different queries (textual queries, visual queries, and visual + textual queries) as shown in these notebooks Search_In_Unsplash.ipynb and Image_To_Text_Search.ipynb.It also achieved these results on the video-text retrieval task for videos from YouTube Test_video.ipynb.This model has outperformed many other models in image classification tasks on some datasets, such as Food-101, CIFAR-10, CIFAR-100, Describable Textures (DTD), Oxford-IIIT Pets (Pets), MNIST, STL-10, the German Traffic Sign Recognition Benchmark (GTSRB), and Rendered SST2 (SST) CR_classification.ipynb.Also, I tested this model on zero-shot image classification tasks as it showed its efficiency in some datasets, such as the (CIFAR-100 classes with some random images), the (puppy and bagel), and the (chihuahua and muffin) datasets Zero-Shot Image Classification.ipynb.

I did this by applying the contrastive learning method to the image-text pair dataset, which was used to train my models. Then I used the Zero-Shot method to retrieve images or texts. My model achieved all of these results, considering the size of the Flickr30k dataset and the number of epochs used in training.

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
Zero-Shot Image Classification_datasets		Zero-Shot Image Classification_datasets
image		image
.gitignore		.gitignore
CLIP.py		CLIP.py
CR_classification.ipynb		CR_classification.ipynb
Image_To_Text_Search.ipynb		Image_To_Text_Search.ipynb
LICENSE		LICENSE
README.md		README.md
Search_In_Unsplash.ipynb		Search_In_Unsplash.ipynb
Test_video.ipynb		Test_video.ipynb
Zero-Shot Image Classification.ipynb		Zero-Shot Image Classification.ipynb
config.py		config.py
dataset.py		dataset.py
main.py		main.py
modules.py		modules.py
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Visual and Vision-Language Representation Pre-Training with Contrastive Learning

Setup

Training the model:

About

Releases

Packages

Languages

License

saadkh1/clip_dual_encoder

Folders and files

Latest commit

History

Repository files navigation

Visual and Vision-Language Representation Pre-Training with Contrastive Learning

Setup

Training the model:

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages