Official Implementation for Learning by Correction: Efficient Tuning Task for Zero-Shot Generative Vision-Language Reasoning (CVPR 2024).
Generative vision-language models (VLMs) have shown like image captioning and visual question answering. However, improving their zero-shot reasoning typically requires second-stage instruction tuning, which relies heavily on human-labeled or large language model-generated challenge, we introduce Image-Conditioned Caption Correction (ICCC), a novel pre-training task designed to enhance VLMs’ zero-shot performance without the need for labeled task-aware data. The ICCC task compels VLMs to rectify mismatches between visual and language concepts, thereby enhancing instruction following and text generation conditioned on visual inputs. Leveraging language structure and a lightweight dependency parser, we construct data labeling and computation costs. Experimental results on BLIP-2 and InstructBLIP demonstrate significant improvements in zero-shot image-text generation-based VL tasks through ICCC instruction tuning.
- Creating conda environment and install pytorch
conda create -n lavis python=3.8
conda activate lavis
# CUDA 11.8
conda install pytorch==2.0.0 torchvision==0.15.0 pytorch-cuda=11.8 -c pytorch -c nvidia
- Install other dependencies:
pip install -r requirements_iccc.txt
# the hugging face version: v4.29.2
Our work is built upon LAVIS, sharing the majority of its requirements.
- Build from source
pip install -e .
- Download pre-trained parameters of LLMs and VLMs:
We employ the same parameters as those used in the Large Language Models (LLMs) and Vision-Language Models (VLMs) implemented in LAVIS. However, we enhance loading efficiency by manually downloading and replacing the model parameters' URL with a local directory.
For LLM Vicuna, please start by preparing the Vicuna 7B v1.1 weights available here. Then, modify the llm_model
in the Model Config to point to the folder containing the Vicuna weights.
We leverage the Visual Genome and COCO image-text datasets, consistent with BLIP in LAVIS. To prepare the data, please refer to the instructions provided by LAVIS.
Prior to ICCC training, data generation involves parsing image-text data using a dependency parser to extract dependency structures and concept bases for training data construction.
python iccc_data_generation.py path/to/data/dir data_filename_without_suffix
# Example for the COCO dataset:
# python iccc_data_generation.py data/coco/annotations coco_karpathy_val
After parsing, ensure the data file path is updated with ICCC in configuration (lavis/configs/datasets/coco/defaults_cap_iccc.yaml
for COCO and lavis/configs/datasets/vg/defaults_cap_iccc.yaml
for Visual Genome).
You can also download the processed data from https://huggingface.co/rj979797/ICCC_CVPR2024/tree/main.
- COCO dataset:
coco_karpathy_train_cnter_nva.json
,coco_karpathy_train_w_parsing_nva.json
- VG dataset:
vg_caption_cnter_nva.json
,vg_caption_w_parsing_nva.json
The model weights can be downloaded from https://huggingface.co/rj979797/ICCC_CVPR2024/tree/main
NoCaps | ||||||||
GQA | OK-VQA | VQAv2 | VSR | BLUE@4 | SPICE | CIDERr | Checkpoint | |
OPT2.7B | 33.5 | 26.6 | 51.9 | 48.3 | 43.6 | 13.8 | 105.7 | |
OPT2.7B w/ ICCC | 38.2 | 29.5 | 54.3 | 47.6 | 46.0 | 14.3 | 111.9 | blip2-iccc-2.7B.pth |
OPT6.7B | 35.5 | 30.7 | 52.6 | 48.5 | 41.5 | 13.0 | 101.4 | |
OPT6.7B w/ ICCC | 38.3 | 31.7 | 58.8 | 51.5 | 44.1 | 13.5 | 106.9 | blip2-iccc-6.7B.pth |
python -m torch.distributed.run --master_port 23619 --nproc_per_node=4 train.py --cfg-path lavis/projects/blip2/train/pretrain_stage2_opt2.7_iccc_iter.yaml --job-name blip2-new_aug_opt2.7-0.3-aug-0.15_swap-simple_prompt --swap-ratio 0.15 --aug-ratio 0.30
python -m torch.distributed.run --master_port 23619 --nproc_per_node=4 train.py --cfg-path lavis/projects/blip2/train/pretrain_stage2_opt6.7_iccc_iter.yaml --job-name blip2-new_aug_opt6.7-0.3-aug-0.15_swap --swap-ratio 0.15 --aug-ratio 0.30
Use eval_script_gen.py
to generate evaluation scripts for all downstream tasks.
This Python script generates the command for starting evaluation in the batch_eval
directory.
python eval_script_gen.py /path/to/experiment/dir
Execute the generated output bash commands will evaluate all intermediate checkpoints of fine-tuning for each downstream task:
bash run_scripts/batch_eval/2024041915535-blip2-new_aug_opt2.7-0.3-aug-0.15_swap_lr5e6-train-gqa.sh ;
bash run_scripts/batch_eval/2024041915535-blip2-new_aug_opt2.7-0.3-aug-0.15_swap_lr5e6-train-okvqa.sh ;
bash run_scripts/batch_eval/2024041915535-blip2-new_aug_opt2.7-0.3-aug-0.15_swap_lr5e6-train-coco_cap.sh ;
bash run_scripts/batch_eval/2024041915535-blip2-new_aug_opt2.7-0.3-aug-0.15_swap_lr5e6-train-nocap.sh ;
bash run_scripts/batch_eval/2024041915535-blip2-new_aug_opt2.7-0.3-aug-0.15_swap_lr5e6-train-vqav2.sh;
You can also use the default evaluate.py
from the LAVIS evaluation toolkit. Provide --cfg-path
and --ckpt-path
to specify the target downstream tasks and model parameters for training.
Below are examples illustrating the usage:
# GQA
python -m torch.distributed.run --master_port 23199 --nproc_per_node=4 evaluate.py --cfg-path "lavis/projects/blip2/eval/gqa_zeroshot_opt2.7b_eval.yaml" --job-name "Pretrain_stage2-2023103113105-blip2-new_aug_opt2.7-0.3-aug-0.15_swap-simple_prompt-role-all-train-checkpoint_7500" --ckpt-path "lavis/output/BLIP2/Pretrain_stage2/2023103113105-blip2-new_aug_opt2.7-0.3-aug-0.15_swap-simple_prompt-role-all-train/checkpoint_7500.pth"
# VQAv2
python -m torch.distributed.run --master_port 23199 --nproc_per_node=4 evaluate.py --cfg-path "lavis/projects/blip2/eval/vqav2_zeroshot_opt2.7b_eval.yaml" --job-name "Pretrain_stage2-2023103113105-blip2-new_aug_opt2.7-0.3-aug-0.15_swap-simple_prompt-role-all-train-checkpoint_7500" --ckpt-path "lavis/output/BLIP2/Pretrain_stage2/2023103113105-blip2-new_aug_opt2.7-0.3-aug-0.15_swap-simple_prompt-role-all-train/checkpoint_7500.pth"
# OKVQA
python -m torch.distributed.run --master_port 23199 --nproc_per_node=4 evaluate.py --cfg-path "lavis/projects/blip2/eval/okvqa_zeroshot_opt2.7b_eval.yaml" --job-name "Pretrain_stage2-2023103113105-blip2-new_aug_opt2.7-0.3-aug-0.15_swap-simple_prompt-role-all-train-checkpoint_7500" --ckpt-path "lavis/output/BLIP2/Pretrain_stage2/2023103113105-blip2-new_aug_opt2.7-0.3-aug-0.15_swap-simple_prompt-role-all-train/checkpoint_7500.pth"
# NoCAP
python -m torch.distributed.run --master_port 23199 --nproc_per_node=4 evaluate.py --cfg-path "lavis/projects/blip2/eval/caption_nocap_opt2.7b_eval.yaml" --job-name "Pretrain_stage2-2023103113105-blip2-new_aug_opt2.7-0.3-aug-0.15_swap-simple_prompt-role-all-train-checkpoint_7500" --ckpt-path "lavis/output/BLIP2/Pretrain_stage2/2023103113105-blip2-new_aug_opt2.7-0.3-aug-0.15_swap-simple_prompt-role-all-train/checkpoint_7500.pth"
# COCOcap
python -m torch.distributed.run --master_port 23199 --nproc_per_node=4 evaluate.py --cfg-path "lavis/projects/blip2/eval/caption_coco_opt2.7b_eval.yaml" --job-name "Pretrain_stage2-2023103113105-blip2-new_aug_opt2.7-0.3-aug-0.15_swap-simple_prompt-role-all-train-checkpoint_7500" --ckpt-path "lavis/output/BLIP2/Pretrain_stage2/2023103113105-blip2-new_aug_opt2.7-0.3-aug-0.15_swap-simple_prompt-role-all-train/checkpoint_7500.pth"
If you find this project helps your research, please kindly consider citing our papers in your publications.
@misc{li2024learning,
title={Learning by Correction: Efficient Tuning Task for Zero-Shot Generative Vision-Language Reasoning},
author={Rongjie Li and Yu Wu and Xuming He},
year={2024},
eprint={2404.00909},
archivePrefix={arXiv},
primaryClass={cs.CV}
}
This repository is built on LAVIS.