When and why vision-language models behave like bags-of-words, and what to do about it? (ICLR 2023 Oral)
Experiments and data for the paper "When and why vision-language models behave like bags-of-words, and what to do about it?".
This paper got an Oral (notable-top-5%) at ICLR 2023! You can find our camera-ready version here.
Imporant Note: Thank you for your interest. I apologize for the delay in releasing the code and the camera-ready version, I will do my best to make up for the missing bits as soon as possible. I am currently in Turkey after the devastating Turkey-Syria earthquake. Not only me, but also tens of thousands of people lost their families and homes. Please consider donating, and at the very least please ask your friends with connections to the regions how they are doing.
Below we give details about how to easily use our dataset and models, and reproduce our experiments.
It's very easy to use VG-Relation and VG-Attribution datasets. Here's an example:
import clip
from dataset_zoo import VG_Relation, VG_Attribution
model, image_preprocess = clip.load("ViT-B/32", device="cuda")
root_dir="/path/to/aro/datasets"
# Setting download=True will download the dataset to `root_dir` if it's not already there.
# For VG-R and VG-A, this is a 1GB zip file that is a subset of GQA.
vgr_dataset = VG_Relation(image_preprocess=preprocess, download=True, root_dir=root_dir)
vga_dataset = VG_Attribution(image_preprocess=preprocess, download=True, root_dir=root_dir)
# Do anything with the dataset. Each item will look like this :
# item = {"image_options": [image], "caption_options": [false_caption, true_caption]}
These datasets require the COCO and Flickr30k retrieval datasets. We provided the interface to download COCO (e.g. set download=True
in the constructor), however, for Flickr30k, you need to sign up and download it yourself. You can find the Flickr30k retrieval dataset here.
from dataset_zoo import COCO_Order, Flickr30k_Order
coco_order_dataset = COCO_Order(image_preprocess=preprocess, download=True, root_dir=root_dir)
flickr_order_dataset = Flickr30k_Order(image_preprocess=preprocess, root_dir=root_dir)
See the notebook in notebooks/
for a quick way to reproduce some of the results in the paper. We provide a notebook to reproduce the VG-Relation and VG-Attribution datasets here.
We experiment with a bunch of models here, and let us know if you have any other you would like to add here. You can find BLIP, CLIP, Flava, and XVLM. Please see model_zoo/
folder for more details. This work is heavily inspired from, and would not be possible without the awesome repos for BLIP, CLIP, Flava, OpenCLIP, and XVLM. A huge, huge thanks to them for open-sourcing their models / implementations! Here's a summary of what we have now:
Model Name | Model File in this Repo | Repo |
---|---|---|
BLIP | BLIP implementation | https://github.com/salesforce/BLIP |
CLIP | CLIP implementation | https://github.com/openai/CLIP |
Flava | Flava implementation | https://huggingface.co/facebook/flava-full |
XVLM | XVLM implementation | https://github.com/zengyan-97/X-VLM |
NegCLIP | NegCLIP was trained with a fork of the open_clip repo. Find the ckpt info here |
https://github.com/vinid/open_clip |
COCA & CLIP on LAION | We added the usage of the other models in the open_clip repo. | https://github.com/mlfoundations/open_clip |
We trained the NegCLIP with a fork of the open_clip
repo. You can find the fork here. Our modifications are super minor and you will find an detailed description of the main edits here.
We plan to add support for the distributed setting in the future. However, we trained the model using a single GPU (which is quite a bit of a limitation). Here's the command to reproduce results:
CUDA_VISIBLE_DEVICES=0 python -m training.main \
--train-data="./mscoco_with_negatives_training.csv" \
--batch-size=256 \
--epochs=5 \
--name="negclip_256_1e-6" \
--lr=1e-6 \
--val-data="./mscoco_with_negatives_valid.csv" \
--logs="./logs/negCLIP/" \
--pretrained="openai" \
--model="ViT-B-32"\
--workers 14 \
--warmup 50
Note here that batch_size=256
would result in a matrix of size 512x1024
with negatives.
If you use this code or data, please consider citing our paper:
@inproceedings{
yuksekgonul2023when,
title={When and why Vision-Language Models behave like Bags-of-Words, and what to do about it?},
author={Mert Yuksekgonul and Federico Bianchi and Pratyusha Kalluri and Dan Jurafsky and James Zou},
booktitle={International Conference on Learning Representations},
year={2023},
url={https://openreview.net/forum?id=KRLUvxh8uaX}
}
Current TODO List.
Name | Description | Status |
---|---|---|
Add support for distributed training | We trained NegCLIP with a single GPU, and we plan to add support for distributed training in the future. | ✅ |
Add negative generation | How to generate negatives for negclip. This could also be on the forked repo. | ✅ |
Please let us know if you have further questions or comments. You can reach out to me at merty@stanford.edu
.