Skip to content

✨✨ Scene-Text Grounding for Text-Based Video Question Answering (arxiv)

License

Notifications You must be signed in to change notification settings

zhousheng97/ViTXT-GQA

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

39 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ViTXT-GQA: Scene-Text Grounding for Text-Based Video Question Answering

Task Task Dataset

This repo is the official implementation of the paper Scene-Text Grounding for Text-Based Video Question Answering.

Introduction

In this work, we propose a novel Grounded TextVideoQA task by forcing the models to answer the questions and spatio-temporally localize the relevant scene texts, thus promoting a research trend towards interpretable QA. The task not only encourages visual evidence for answer predictions, but also isolates the challenges inherited in QA and scene text recognition, enabling the diagnosis of the root causes for failure predictions, 𝑒.𝑔., wrong QA or wrong scene text recognition? To achieve grounded TextVideoQA, we propose a baseline model, T2S-QA. The model highlights a disentangled temporal- and spatial-contrastive learning strategy for weakly grounding and grounded QA. Finally, to evaluate grounded TextVideoQA, we construct a new dataset ViTXT-GQA, by extending the existing largest TextVideoQA dataset with answer grounding (spatio-temporal location) labels.

This repository provides the code for our paper, including:

  • T2S-QA baseline model and ViTXT-GQA benchmark dataset.
  • Data preprocessing instructions, including data preprocessing and feature extraction scripts, as well as preprocessed features.
  • Training and evaluation scripts and checkpoints.

Installation

Clone this repository, and build it with the following command.

conda create -n vitxtgqa python==3.8
conda activate vitxtgqa
conda install pytorch==2.1.1 torchvision==0.16.1 torchaudio==2.1.1 pytorch-cuda=12.1 -c pytorch -c nvidia

git clone https://github.com/zhousheng97/vitxtgqa.git
cd ViTXT-GQA
pip install -r requirement.txt
python setup.py build develop

Data Preparation

Please create a data folder root/data/ outside this repo folder root/ViTXT-GQA/ so that the two folders are in the same directory.

  • Raw Video and Video Feature. You can directly download the provided video feature video feature path or apply here to download the raw video and then extract features. If you download the raw videos, you need to decode each video at 10fps and then extract the frame feature of ViT via the script provided in ViTXT-GQA/tools/video_feat/obtain_vit_feat.py. Extract video feature into data/fps10_video_vit_feat.

  • OCR Detection and Recognition. Based on the OCR detector TransVTSpotter, we provide the recognition results of OCR recognition systems ABINet and CLIPOCR, the download links are: vitxtgqa_abinet and vitxtgqa_clip.

  • Dataset Annotation. We provide the dataset files here, including grounding annotation files, QA files, and vocabulary files. (Note: The extracted video frame ids start from 1, while the video frames and bounding box annotation ids in the grounding file start from 0.)

  • Other. The fixed vocabulary is obtained by ViTXTGQA/pythia/scripts/extract_vocabulary.py

Repo structure as below:

root
├── ViTXT-GQA
├── data
│   └── fps10_ocr_detection
│   └── fps10_ocr_detection_ClipOCR
│   └── fps10_video_vit_feat
│   └── vitxtgqa
│       ├── ground_annotation
│       ├── qa_annotation
│       ├── vocabulary

The following is an example of a spatio-temporal grounding label file:

[{
      "question_id": 12393,
      "video_id": "02669",
      "fps": 10.0,
      "frames": 61,
      "duration": 6.1,
      "height": 1080,
      "width": 1920,
      "spatial_temporal_gt": [
      {
          "temporal_gt": [
              5.1,
              5.2
          ],
          "bbox_gt": {
              "51": [
                  799.2452830188679,
                  271.6981132075472,
                  858.1132075471698,
                  326.0377358490566
              ],
              "52": [
                  881.5686274509803,
                  295.1162790697674,
                  928.6274509803922,
                  357.906976744186
              ]
          }
      }]
  }]

Training and Evaluation

The training and evaluation commands can be found in the T2S/scripts. The config files can be found in the T2S/configs.

  • Train the model on the training set:
# bash scripts/<train.sh> <GPU_ids> <save_dir>

bash scripts/train_t2s_abinet.sh 0,1 vitxtgqa_debug_abinet
  • Evaluate the pretrained model on the validation/test sets:
# bash scripts/<val.sh> <GPU_ids> <save_dir> <checkpoint> <run_type>

bash scripts/val_t2s_abinet.sh 0,1 vitxtgqa_debug save/vitxtgqa_debug_abinet/vitxtgqa_t2s_13/best.ckpt val

bash scripts/val_t2s_abinet.sh 0,1 vitxtgqa_debug save/vitxtgqa_debug_abinet/vitxtgqa_t2s_13/best.ckpt inference

Note: you can access the checkpoint: T2S_abinet and T2S_clipocr.

Visualization (ViTXT-GQA)

Visualization

Acknowledgements

The model implementation of our T2S-QA is inspired by MMF. The dataset of our VText-GQA is inspired by M4-ViteVQA.

Citation

If you found this work useful, consider giving this repository a star and citing our papers as follows:

@article{zhou2024scene,
  title={Scene-Text Grounding for Text-Based Video Question Answering},
  author={Zhou, Sheng and Xiao, Junbin and Yang, Xun and Song, Peipei and Guo, Dan and Yao, Angela and Wang, Meng and Chua, Tat-Seng},
  journal={arXiv preprint arXiv:2409.14319},
  year={2024}
}

About

✨✨ Scene-Text Grounding for Text-Based Video Question Answering (arxiv)

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages