GSRFormer is an approach for grounded situation recognition (GSR) that aims to mimic human-like understanding of visual scenes. While machines can detect objects and classify images well, interpreting the narrative and semantics conveyed in an image remains challenging.
GSRFormer seeks to advance GSR by modeling not just primary actions, but the associated entities and roles that form a cohesive visual situation.
-
Alternating Learning Scheme: Uses an innovative bidirectional learning process between verbs and nouns to ensure a holistic semantic understanding beyond unidirectional interpretations.
-
Pseudo Labeling: Initially assumes pseudo labels for semantic roles to focus directly on learning intermediate representations from images, avoiding verb ambiguity issues in conventional GSR.
-
Support Images: Leverages supplementary images during training to refine verbs using corresponding nouns and vice versa, enhancing generalization.
For more details, please refer to our ACM Multimedia 2022 paper.
GSRFormer achieves state-of-the-art performance on two benchmark GSR datasets, advancing scene understanding and narrative interpretation capabilities.
- Conda
- PyTorch
- Start by cloning the repository:
git clone https://github.com/zhiqic/GSRFormer.git cd GSRFormer
- Create and activate the Conda environment:
conda create --name GSRFormer python=3.9 conda activate GSRFormer conda install pytorch==1.8.0 torchvision==0.9.0 cudatoolkit=11.1 -c pytorch -c conda-forge
- Install the packages required for the project:
pip install -r requirements.txt
The SWiG dataset plays a pivotal role in the model's training and validation:
- Annotations: Found in "SWiG/SWiG_jsons/".
- Images: Download them here and place them in "SWiG/images_512/".
- Images: "SWiG/images_512/"
- Training Set:
train.json
- Development Set:
dev.json
- Testing Set:
test.json
Kickstart the training with:
python -m torch.distributed.launch --nproc_per_node=4 --use_env main.py \
--backbone resnet50 --dataset_file swig \
--encoder_epochs 20 --decoder_epochs 25 \
--preprocess True \
--num_workers 4 --num_enc_layers 6 --num_dec_layers 5 \
--dropout 0.15 --hidden_dim 512 --output_dir GSRFormer
Assess your model using:
python main.py --output_dir GSRFormer --dev
python main.py --output_dir GSRFormer --test
For real-time application on custom images:
python inference.py --image_path inference/filename.jpg \
--output_dir inference
Thank you to the authors of CoFormer repository for providing an excellent codebase that enabled our advancements. We sincerely appreciate the support of Microsoft Research throughout this project.
Enriching the AI community is our goal. If building upon this work, please reference:
@inproceedings{cheng2022gsrformer,
title={GSRFormer: Grounded Situation Recognition Transformer with Alternate Semantic Attention Refinement},
author={Cheng, Zhi-Qi and Dai, Qi and Li, Siyao and Mitamura, Teruko and Hauptmann, Alexander},
booktitle={Proceedings of the 30th ACM International Conference on Multimedia},
pages={3272--3281},
year={2022}
}
Refer to the Apache 2.0 license provided in LICENSE for usage details.