ViTXT-GQA: Scene-Text Grounding for Text-Based Video Question Answering

This repo is the official implementation of the paper Scene-Text Grounding for Text-Based Video Question Answering.

Introduction

In this work, we propose a novel Grounded TextVideoQA task by forcing the models to answer the questions and spatio-temporally localize the relevant scene texts, thus promoting a research trend towards interpretable QA. The task not only encourages visual evidence for answer predictions, but also isolates the challenges inherited in QA and scene text recognition, enabling the diagnosis of the root causes for failure predictions, 𝑒.𝑔., wrong QA or wrong scene text recognition? To achieve grounded TextVideoQA, we propose a baseline model, T2S-QA. The model highlights a disentangled temporal- and spatial-contrastive learning strategy for weakly grounding and grounded QA. Finally, to evaluate grounded TextVideoQA, we construct a new dataset ViTXT-GQA, by extending the existing largest TextVideoQA dataset with answer grounding (spatio-temporal location) labels.

This repository provides the code for our paper, including:

T2S-QA baseline model and ViTXT-GQA benchmark dataset.
Data preprocessing instructions, including data preprocessing and feature extraction scripts, as well as preprocessed features.
Training and evaluation scripts and checkpoints.

Installation

Clone this repository, and build it with the following command.

conda create -n vitxtgqa python==3.8
conda activate vitxtgqa
conda install pytorch==2.1.1 torchvision==0.16.1 torchaudio==2.1.1 pytorch-cuda=12.1 -c pytorch -c nvidia

git clone https://github.com/zhousheng97/vitxtgqa.git
cd ViTXT-GQA
pip install -r requirement.txt
python setup.py build develop

Data Preparation

Please create a data folder root/data/ outside this repo folder root/ViTXT-GQA/ so that the two folders are in the same directory.

Raw Video and Video Feature. You can directly download the provided video feature video feature path or apply here to download the raw video and then extract features. If you download the raw videos, you need to decode each video at 10fps and then extract the frame feature of ViT via the script provided in ViTXT-GQA/tools/video_feat/obtain_vit_feat.py. Extract video feature into data/fps10_video_vit_feat.
OCR Detection and Recognition. Based on the OCR detector TransVTSpotter, we provide the recognition results of OCR recognition systems ABINet and CLIPOCR, the download links are: vitxtgqa_abinet and vitxtgqa_clip.
Dataset Annotation. We provide the dataset files here, including grounding annotation files, QA files, and vocabulary files. (Note: The extracted video frame ids start from 1, while the video frames and bounding box annotation ids in the grounding file start from 0.)
Other. The fixed vocabulary is obtained by ViTXTGQA/pythia/scripts/extract_vocabulary.py

Repo structure as below:

root
├── ViTXT-GQA
├── data
│   └── fps10_ocr_detection
│   └── fps10_ocr_detection_ClipOCR
│   └── fps10_video_vit_feat
│   └── vitxtgqa
│       ├── ground_annotation
│       ├── qa_annotation
│       ├── vocabulary

The following is an example of a spatio-temporal grounding label file：

[{
      "question_id": 12393,
      "video_id": "02669",
      "fps": 10.0,
      "frames": 61,
      "duration": 6.1,
      "height": 1080,
      "width": 1920,
      "spatial_temporal_gt": [
      {
          "temporal_gt": [
              5.1,
              5.2
          ],
          "bbox_gt": {
              "51": [
                  799.2452830188679,
                  271.6981132075472,
                  858.1132075471698,
                  326.0377358490566
              ],
              "52": [
                  881.5686274509803,
                  295.1162790697674,
                  928.6274509803922,
                  357.906976744186
              ]
          }
      }]
  }]

Training and Evaluation

The training and evaluation commands can be found in the T2S/scripts. The config files can be found in the T2S/configs.

Train the model on the training set:

# bash scripts/<train.sh> <GPU_ids> <save_dir>

bash scripts/train_t2s_abinet.sh 0,1 vitxtgqa_debug_abinet

Evaluate the pretrained model on the validation/test sets:

# bash scripts/<val.sh> <GPU_ids> <save_dir> <checkpoint> <run_type>

bash scripts/val_t2s_abinet.sh 0,1 vitxtgqa_debug save/vitxtgqa_debug_abinet/vitxtgqa_t2s_13/best.ckpt val

bash scripts/val_t2s_abinet.sh 0,1 vitxtgqa_debug save/vitxtgqa_debug_abinet/vitxtgqa_t2s_13/best.ckpt inference

Note: you can access the checkpoint: T2S_abinet and T2S_clipocr.

Visualization (ViTXT-GQA)

Acknowledgements

The model implementation of our T2S-QA is inspired by MMF. The dataset of our VText-GQA is inspired by M4-ViteVQA.

Citation

If you found this work useful, consider giving this repository a star and citing our papers as follows:

@article{zhou2024scene,
  title={Scene-Text Grounding for Text-Based Video Question Answering},
  author={Zhou, Sheng and Xiao, Junbin and Yang, Xun and Song, Peipei and Guo, Dan and Yao, Angela and Wang, Meng and Chua, Tat-Seng},
  journal={arXiv preprint arXiv:2409.14319},
  year={2024}
}

Name		Name	Last commit message	Last commit date
Latest commit History 41 Commits
configs		configs
pythia		pythia
scripts		scripts
tools		tools
LICENSE.txt		LICENSE.txt
README.md		README.md
gitignore.txt		gitignore.txt
image.png		image.png
requirement.txt		requirement.txt
setup.py		setup.py
vitxtgqa.png		vitxtgqa.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ViTXT-GQA: Scene-Text Grounding for Text-Based Video Question Answering

Introduction

Installation

Data Preparation

Training and Evaluation

Visualization (ViTXT-GQA)

Acknowledgements

Citation

About

Releases

Packages

Languages

License

zhousheng97/ViTXT-GQA

Folders and files

Latest commit

History

Repository files navigation

ViTXT-GQA: Scene-Text Grounding for Text-Based Video Question Answering

Introduction

Installation

Data Preparation

Training and Evaluation

Visualization (ViTXT-GQA)

Acknowledgements

Citation

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages