Perceiver-VL: Efficient Vision-and-Language Modeling with Iterative Latent Attention [WACV 2023 bib]
Zineng Tang*, Jaemin Cho*, Jie Lei, Mohit Bansal
Learning vision-language representation by iterative latent attention that scales with long inputs linearly.
Perceiver-VL Architecture Overview
conda create -n Perceiver-VL python=3.8 # You can also use other environment.
pip install -r requirements.txt
TODO: Finish datasets/tasks instructions and scripts
# Pretrain on Webvid + GCC
bash scripts/co_pretrain.sh
# Pretrain on Webvid
bash scripts/webvid_pretrain.sh
# Pretrain on GCC
bash scripts/gcc_pretrain.sh
# Pretrain on ImageNet
bash scripts/imagenet_pretrain.sh
Download Checkpoint [link]
# Fintune on MSRVTT Retrieval
bash scripts/msrvtt_vrtr_finetune.sh
# Fintune on VQA
bash scripts/vqa_finetune.sh
Perceiver_VL
│
├── assets # illustrations
│ └── architecture.png
│
├── model # main source
│ ├── datamodules # pytorch-lightning wrap
│ │ ├── datamodule_base.py
│ │ └── ...
│ └── datasets # Datasets
│ │ ├── vqa_dataset.py
│ │ └── ...
│ ├── gadgets
│ │ └── my_metrics.py # metric utils
│ ├── modules
│ │ ├── heads.py # model heads
│ │ ├── model_module.py # pytorch-lightning wrap for model
│ │ ├── model_utils.py # pytorch-lightning wrap for training metrics
│ │ ├── objectives.py # pretraining/finetuning objectives
│ │ └── perceiver_vl.py # main model
│ ├── transforms # image transformation utils
│ │ └── ...
│ └── config.py # all configurations
│
├── scripts # all scripts
│ ├── vqa_finetune.sh
│ ├── co_pretrain.sh
│ └── ...
│
├── run.py # main
└── requirements.txt
@inproceedings{tang2023wacv,
title = {Perceiver-VL: Efficient Vision-and-Language Modeling with Iterative Latent Attention},
author = {Zineng Tang and Jaemin Cho and Jie Lei and Mohit Bansal},
booktitle = {WACV},
year = {2023}
}
Our codebase is based on ViLT. We thank the authors for their open-source contributions.
Zineng Tang (zn.tang.terran@gmail.com)