Skip to content

Latest commit

 

History

History
166 lines (153 loc) · 6.61 KB

README.md

File metadata and controls

166 lines (153 loc) · 6.61 KB

Contrastive Video Question Answering via Video Graph Transformer

Abstract This repo holds the code for our paper CoVGT accepted to IEEE T-PAMI'23. The work extends our preliminary publication at ECCV'22. We highlight the following differences compared to the conference version:
  • Jointly supervised and self-supervised contrastive objectives to optimize VGT.
  • Substitute BERT with a stronger language model (e.g., RoBERTa) for QA embedding.
  • Extended results on Causal-VidQA and STAR-QA and more comprehensive ablation studies.

The code is based on VGT.

Illustration of contrastive learning strategy

Todo

  1. Release feature of other datasets. Please email the first author and specify the reason as the data is strictly for research purpose.

Environment

Assume you have installed Anaconda3, cuda version > 11.0 with gpu memory >= 24G, please do the following to setup the envs:

>conda create -n videoqa python==3.8.16
>conda activate videoqa
>git clone https://github.com/doc-doc/CoVGT.git
>pip install -r requirements.txt
>conda install pytorch==1.8.1 torchvision==0.9.1 cudatoolkit=11.1 -c pytorch -c nvidia

Preparation

Please create a data folder outside this repo, so you have two folders in your workspace 'workspace/data/' and 'workspace/CoVGT/'.

Below we use NExT-QA as an example to get you farmiliar with the code. Please download the related video feature and QA annotations according to the links provided in the Results and Resources section. Note that the QA annotations will be saved into workspace/CoVGT/datasets/nextqa/ after you clone this repo., video features into workspace/data/nextqa/ and checkpoint files into workspace/data/save_models/nextqa/. Change default paths in global_parameters.py and args.py for your own datasets.

Inference

./shell/next_test.sh 0

Evaluation

python eval_next.py --folder CoVGT_FTCoWV --mode test

Results and Resources

Table 1. VideoQA Accuracy (%) on Test Set.

Cross-Modal Pretrain NExT-QA Causal-VidQA STAR TGIF-QA (Action) TGIF-QA (Trans) TGIF-QA (FrameQA) TGIF-QA-R* (Action) TGIF-QA-R* (Trans) MSRVTT-QA
- 59.4 59.1 44.0 94.7 97.6 61.6 60.8 73.8 38.3
WebVid0.18M 59.7 60.8 46.2 91.3 96.2 61.7 61.0 73.2 40.0
- feats feats feats feats feats feats feats feats feats
- videos videos videos videos videos videos videos videos videos
- Q&A Q&A Q&A Q&A Q&A Q&A Q&A Q&A Q&A
(The feature files are identical to VGT. We have merged some files of the same dataset to avoid too many links.)

Train

We have provided all the scripts in the folder 'shells', you can start your training by specifying the GPU IDs behind the script. (If you have multiple GPUs, you can separate them with comma: ./shell/nextqa_train.sh 0,1)

./shell/nextqa_train.sh 0

It will train the model and save to the folder 'save_models/nextqa/CoVGT/'. You will get results around 60.1% and 59.4% on the val and test set respectively.

Result Visualization (NExT-QA)

VGT vs VGT without DGT

Citations

@ARTICLE {xiao2023contrastive,
author = {Junbin Xiao and Pan Zhou and Angela Yao and Yicong Li and Richang Hong and Shuicheng Yan and Tat Seng Chua},
journal = {IEEE Transactions on Pattern Analysis & Machine Intelligence},
title = {Contrastive Video Question Answering via Video Graph Transformer},
year = {2023},
volume = {45},
number = {11},
issn = {1939-3539},
pages = {13265-13280},
doi = {10.1109/TPAMI.2023.3292266},
publisher = {IEEE Computer Society},
address = {Los Alamitos, CA, USA},
month = {nov}
}
@inproceedings{xiao2022video,
  title={Video Graph Transformer for Video Question Answering},
  author={Xiao, Junbin and Zhou, Pan and Chua, Tat-Seng and Yan, Shuicheng},
  booktitle={European Conference on Computer Vision},
  pages={39--58},
  year={2022},
  organization={Springer}
}

Notes

If you use any resources from this repo, please kindly cite our paper and acknowledge the source.

License

This repository is released under the Apache 2.0 license as found in the LICENSE file.