This repo is the official implementation of the paper Scene-Text Grounding for Text-Based Video Question Answering.
In this work, we propose a novel Grounded TextVideoQA task by forcing the models to answer the questions and spatio-temporally localize the relevant scene texts, thus promoting a research trend towards interpretable QA. The task not only encourages visual evidence for answer predictions, but also isolates the challenges inherited in QA and scene text recognition, enabling the diagnosis of the root causes for failure predictions, 𝑒.𝑔., wrong QA or wrong scene text recognition? To achieve grounded TextVideoQA, we propose a baseline model, T2S-QA. The model highlights a disentangled temporal- and spatial-contrastive learning strategy for weakly grounding and grounded QA. Finally, to evaluate grounded TextVideoQA, we construct a new dataset ViTXT-GQA, by extending the existing largest TextVideoQA dataset with answer grounding (spatio-temporal location) labels.
This repository provides the code for our paper, including:
- T2S-QA baseline model and ViTXT-GQA benchmark dataset.
- Data preprocessing instructions, including data preprocessing and feature extraction scripts, as well as preprocessed features.
- Training and evaluation scripts and checkpoints.
Clone this repository, and build it with the following command.
conda create -n vitxtgqa python==3.8
conda activate vitxtgqa
conda install pytorch==2.1.1 torchvision==0.16.1 torchaudio==2.1.1 pytorch-cuda=12.1 -c pytorch -c nvidia
git clone https://github.com/zhousheng97/vitxtgqa.git
cd ViTXT-GQA
pip install -r requirement.txt
python setup.py build develop
Please create a data folder root/data/
outside this repo folder root/ViTXT-GQA/
so that the two folders are in the same directory.
-
Raw Video and Video Feature. You can directly download the provided video feature video feature path or apply here to download the raw video and then extract features. If you download the raw videos, you need to decode each video at 10fps and then extract the frame feature of ViT via the script provided in
ViTXT-GQA/tools/video_feat/obtain_vit_feat.py
. Extract video feature intodata/fps10_video_vit_feat
. -
OCR Detection and Recognition. Based on the OCR detector TransVTSpotter, we provide the recognition results of OCR recognition systems ABINet and CLIPOCR, the download links are: vitxtgqa_abinet and vitxtgqa_clip.
-
Dataset Annotation. We provide the dataset files here, including grounding annotation files, QA files, and vocabulary files. (Note: The extracted video frame ids start from 1, while the video frames and bounding box annotation ids in the grounding file start from 0.)
-
Other. The fixed vocabulary is obtained by
ViTXTGQA/pythia/scripts/extract_vocabulary.py
Repo structure as below:
root
├── ViTXT-GQA
├── data
│ └── fps10_ocr_detection
│ └── fps10_ocr_detection_ClipOCR
│ └── fps10_video_vit_feat
│ └── vitxtgqa
│ ├── ground_annotation
│ ├── qa_annotation
│ ├── vocabulary
The following is an example of a spatio-temporal grounding label file:
[{
"question_id": 12393,
"video_id": "02669",
"fps": 10.0,
"frames": 61,
"duration": 6.1,
"height": 1080,
"width": 1920,
"spatial_temporal_gt": [
{
"temporal_gt": [
5.1,
5.2
],
"bbox_gt": {
"51": [
799.2452830188679,
271.6981132075472,
858.1132075471698,
326.0377358490566
],
"52": [
881.5686274509803,
295.1162790697674,
928.6274509803922,
357.906976744186
]
}
}]
}]
The training and evaluation commands can be found in the T2S/scripts
. The config files can be found in the T2S/configs
.
- Train the model on the training set:
# bash scripts/<train.sh> <GPU_ids> <save_dir>
bash scripts/train_t2s_abinet.sh 0,1 vitxtgqa_debug_abinet
- Evaluate the pretrained model on the validation/test sets:
# bash scripts/<val.sh> <GPU_ids> <save_dir> <checkpoint> <run_type>
bash scripts/val_t2s_abinet.sh 0,1 vitxtgqa_debug save/vitxtgqa_debug_abinet/vitxtgqa_t2s_13/best.ckpt val
bash scripts/val_t2s_abinet.sh 0,1 vitxtgqa_debug save/vitxtgqa_debug_abinet/vitxtgqa_t2s_13/best.ckpt inference
Note: you can access the checkpoint: T2S_abinet and T2S_clipocr.
The model implementation of our T2S-QA is inspired by MMF. The dataset of our VText-GQA is inspired by M4-ViteVQA.
If you found this work useful, consider giving this repository a star and citing our papers as follows:
@article{zhou2024scene,
title={Scene-Text Grounding for Text-Based Video Question Answering},
author={Zhou, Sheng and Xiao, Junbin and Yang, Xun and Song, Peipei and Guo, Dan and Yao, Angela and Wang, Meng and Chua, Tat-Seng},
journal={arXiv preprint arXiv:2409.14319},
year={2024}
}