Contrastive Video Question Answering via Video Graph Transformer

Abstract

This repo holds the code for our paper CoVGT accepted to IEEE T-PAMI'23. The work extends our preliminary publication at ECCV'22. We highlight the following differences compared to the conference version:

Jointly supervised and self-supervised contrastive objectives to optimize VGT.
Substitute BERT with a stronger language model (e.g., RoBERTa) for QA embedding.
Extended results on Causal-VidQA and STAR-QA and more comprehensive ablation studies.

The code is based on VGT.

Illustration of contrastive learning strategy

Todo

Release feature of other datasets. Please email the first author and specify the reason as the data is strictly for research purpose.

Environment

Assume you have installed Anaconda3, cuda version > 11.0 with gpu memory >= 24G, please do the following to setup the envs:

>conda create -n videoqa python==3.8.16
>conda activate videoqa
>git clone https://github.com/doc-doc/CoVGT.git
>pip install -r requirements.txt
>conda install pytorch==1.8.1 torchvision==0.9.1 cudatoolkit=11.1 -c pytorch -c nvidia

Preparation

Please create a data folder outside this repo, so you have two folders in your workspace 'workspace/data/' and 'workspace/CoVGT/'.

Below we use NExT-QA as an example to get you farmiliar with the code. Please download the related video feature and QA annotations according to the links provided in the Results and Resources section. Note that the QA annotations will be saved into workspace/CoVGT/datasets/nextqa/ after you clone this repo., video features into workspace/data/nextqa/ and checkpoint files into workspace/data/save_models/nextqa/. Change default paths in global_parameters.py and args.py for your own datasets.

Inference

./shell/next_test.sh 0

Evaluation

python eval_next.py --folder CoVGT_FTCoWV --mode test

Results and Resources

Table 1. VideoQA Accuracy (%) on Test Set.

Cross-Modal Pretrain	NExT-QA	Causal-VidQA	STAR	TGIF-QA (Action)	TGIF-QA (Trans)	TGIF-QA (FrameQA)	TGIF-QA-R* (Action)	TGIF-QA-R* (Trans)	MSRVTT-QA
-	59.4	59.1	44.0	94.7	97.6	61.6	60.8	73.8	38.3
WebVid0.18M	59.7	60.8	46.2	91.3	96.2	61.7	61.0	73.2	40.0
-	feats	feats	feats	feats	feats	feats	feats	feats	feats
-	videos	videos	videos	videos	videos	videos	videos	videos	videos
-	Q&A	Q&A	Q&A	Q&A	Q&A	Q&A	Q&A	Q&A	Q&A

(The feature files are identical to VGT. We have merged some files of the same dataset to avoid too many links.)

Train

We have provided all the scripts in the folder 'shells', you can start your training by specifying the GPU IDs behind the script. (If you have multiple GPUs, you can separate them with comma: ./shell/nextqa_train.sh 0,1)

./shell/nextqa_train.sh 0

It will train the model and save to the folder 'save_models/nextqa/CoVGT/'. You will get results around 60.1% and 59.4% on the val and test set respectively.

Result Visualization (NExT-QA)

Citations

@ARTICLE {xiao2023contrastive,
author = {Junbin Xiao and Pan Zhou and Angela Yao and Yicong Li and Richang Hong and Shuicheng Yan and Tat Seng Chua},
journal = {IEEE Transactions on Pattern Analysis &amp; Machine Intelligence},
title = {Contrastive Video Question Answering via Video Graph Transformer},
year = {2023},
volume = {45},
number = {11},
issn = {1939-3539},
pages = {13265-13280},
doi = {10.1109/TPAMI.2023.3292266},
publisher = {IEEE Computer Society},
address = {Los Alamitos, CA, USA},
month = {nov}
}

@inproceedings{xiao2022video,
  title={Video Graph Transformer for Video Question Answering},
  author={Xiao, Junbin and Zhou, Pan and Chua, Tat-Seng and Yan, Shuicheng},
  booktitle={European Conference on Computer Vision},
  pages={39--58},
  year={2022},
  organization={Springer}
}

Notes

If you use any resources from this repo, please kindly cite our paper and acknowledge the source.

License

This repository is released under the Apache 2.0 license as found in the LICENSE file.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Contrastive Video Question Answering via Video Graph Transformer

Todo

Environment

Preparation

Inference

Evaluation

Results and Resources

Train

Result Visualization (NExT-QA)

Citations

Notes

License

Files

README.md

Latest commit

History

README.md

File metadata and controls

Contrastive Video Question Answering via Video Graph Transformer

Todo

Environment

Preparation

Inference

Evaluation

Results and Resources

Train

Result Visualization (NExT-QA)

Citations

Notes

License