Skip to content

[NeurIPS 2022 Spotlight] RLIP: Relational Language-Image Pre-training and a series of other methods to solve HOI detection and Scene Graph Generation.

License

Notifications You must be signed in to change notification settings

JacobYuan7/RLIP

Repository files navigation

[NeurIPS 2022 Spotlight] RLIP: Relational Language-Image Pre-training for Human-Object Interaction Detection

arXiv GitHub Stars GitHub Forks Hits visitors

Updates

  • NewsπŸ’₯! The follow-up work RLIPv2: Fast Scaling of Relational Language-Image Pre-training is accepted to ICCV 2023. Its code have been released in RLIPv2 repo.
  • Update on Jan. 19th, 2023: I am uploading the code. Note that I changed all the path to prevent from possible information leakage. In order to run the code, you will need to configure the paths to match your own system. To do this, search for the "/PATH/TO" placeholder in the code and replace it with the appropriate file path on your system. ⭐⭐⭐Consider starring the repo! ⭐⭐⭐
  • Update on Jan. 16th, 2023: I have uploaded the annotations and checkpoints.
  • Update on Dec. 12th, 2022: The code is under pre-release review in Alibaba Group, which will be made public as soon as possible.
  • NewsπŸ’₯! RLIP: Relational Language-Image Pre-training is accepted to NeurIPS 2022 as a Spotlight presentation (Top 5%)! Hope you will enjoy reading it.

Todo List

Note that if you can not get access to the links provided below, try using another browser or contact me by e-mail.

  • πŸŽ‰ Release annotations for VG pre-training, HICO-DET few-shot, zero-shot and relation label noise.
  • πŸŽ‰ Release checkpoints for pre-training, few-shot, zero-shot and fine-tuning.
  • πŸŽ‰ Release code for pre-training, fine-tuning and inference.
  • πŸŽ‰ Include support for inference on custom images.
  • πŸ•˜ Include support for Scene Graph Generation. (It has been supported in RLIPv2.)

Model Outline

This repo contains the implementation of various methods to resolve HOI detection (not limited to RLIP), aiming to serve as a benchmark for HOI detection. Below methods are included in this repo:

  • RLIP-ParSe (model name in the repo: RLIP-ParSe);
  • ParSe (model name in the repo: ParSe);
  • RLIP-ParSeD (model name in the repo: RLIP-ParSeD);
  • ParSeD (model name in the repo: ParSeD);
  • OCN (model name in the repo: OCN), which is a prior work of RLIP;
  • QPIC (model name in the repo: DETRHOI);
  • QAHOI (model name in the repo: DDETRHOI);
  • CDN (model name in the repo: CDN);

Citation

If you find our work inspiring or our code/annotations useful to your research, please cite:

@inproceedings{Yuan2022RLIP,
  title={RLIP: Relational Language-Image Pre-training for Human-Object Interaction Detection},
  author={Yuan, Hangjie and Jiang, Jianwen and Albanie, Samuel and Feng, Tao and Huang, Ziyuan and Ni, Dong and Tang, Mingqian},
  booktitle={Advances in Neural Information Processing Systems (NeurIPS)},
  year={2022}
}

@inproceedings{Yuan2023RLIPv2,
  title={RLIPv2: Fast Scaling of Relational Language-Image Pre-training},
  author={Yuan, Hangjie and Zhang, Shiwei and Wang, Xiang and Albanie, Samuel and Pan, Yining and Feng, Tao and Jiang, Jianwen and Ni, Dong and Zhang, Yingya and Zhao, Deli},
  booktitle={Proceedings of the IEEE/CVF International Conference on Computer Vision},
  year={2023}
}

@inproceedings{Yuan2022OCN,
  title={Detecting Human-Object Interactions with Object-Guided Cross-Modal Calibrated Semantics},
  author={Hangjie Yuan and Mang Wang and Dong Ni and Liangpeng Xu},
  booktitle={AAAI},
  year={2022}
}

Inference on Custom Images

To facilitate the use of custom images without annotations, I have implemented a version of code that supports this. In terms of the pre-trained model, I am using the best-performing RLIP-ParSe. To begin, please place your images in the folder custom_imgs. Then, you could try running the code below:

cd /PATH/TO/RLIP
# RLIP-ParSe
bash scripts/Inference_on_custom_imgs.sh

After successfully running the code, the generated results will be available in the folder custom_imgs/result. I have tested the code on a single Tesla A100 with batch sizes of 1, 2, 3 and 4. Note that, by default, we saved all the detection results (64 pairs). In most cases, it is possible to set a threshold for the verb_scores (multiplication of relation scores and object scores) in the saved results, which will enable their use in your own work. You can do it by yourself and tune the threshold for your work. As a recommendation, 0.25 (0.5*0.5) might be a good start to try out.

Annotation Preparation

Dataset Setting Download
VG Pre-training Link
HICO-DET Few-shot 1%, 10% Link
HICO-DET Zero-shot (UC-NF, UC-RF)* Link
HICO-DET Relation Label Noise (10%, 30%, 50%) Link

Note: β‘  * Zero-shot (NF) do not need any HICO-DET annotations for fine-tuning, so we only provide training annotations for the UC-NF and UC-RF setting.

Pre-training Dataset (Visual Genome) preparation

Firstly, we could download VG dataset from the official link, inclduing images Part I and Part II. (Note: If the official website is not working, you can use the link that I provide: Images and Images2.) The annotations after pre-processing could be downloaded from the link above, which is used for pre-training. Note that this is generated from scene_graphs.json file by several pre-processing steps to remove redundant triplets. Also, several settings mentioned below also need the annotations that we provide. VG dataset and its corresponding annotations should be organized as follows:

VG
 |─ annotations
 |   |β€” scene_graphs_after_preprocessing.json
 |   :
 |β€” images
 |   |β€” 2409818.jpg
 |   |β€” n102412.jpg
 :   :

Downstream Dataset preparation

1. HICO-DET

HICO-DET dataset can be downloaded here. After finishing downloading, unpack the tarball (hico_20160224_det.tar.gz) to the data directory.

Instead of using the original annotations files, we use the annotation files provided by the PPDM authors. The annotation files can be downloaded from here. The downloaded annotation files have to be placed as follows.

qpic
 |─ data
 β”‚   └─ hico_20160224_det
 |       |─ annotations
 |       |   |─ trainval_hico.json
 |       |   |─ test_hico.json
 |       |   └─ corre_hico.npy
 :       :

2. V-COCO

First clone the repository of V-COCO from here, and then follow the instruction to generate the file instances_vcoco_all_2014.json. Next, download the prior file prior.pickle from here. Place the files and make directories as follows.

qpic
 |─ data
 β”‚   └─ v-coco
 |       |─ data
 |       |   |─ instances_vcoco_all_2014.json
 |       |   :
 |       |─ prior.pickle
 |       |─ images
 |       |   |─ train2014
 |       |   |   |─ COCO_train2014_000000000009.jpg
 |       |   |   :
 |       |   └─ val2014
 |       |       |─ COCO_val2014_000000000042.jpg
 |       |       :
 |       |─ annotations
 :       :

The annotation file has to be converted to the HOIA format. The conversion can be conducted as follows.

PYTHONPATH=data/v-coco \
        python convert_vcoco_annotations.py \
        --load_path data/v-coco/data \
        --prior_path data/v-coco/prior.pickle \
        --save_path data/v-coco/annotations

Note that only Python2 can be used for this conversion because vsrl_utils.py in the v-coco repository shows a error with Python3.

V-COCO annotations with the HOIA format, corre_vcoco.npy, test_vcoco.json, and trainval_vcoco.json will be generated to annotations directory.

RLIP Pre-training

Since RLIP pre-trained on VG and COCO dataset, we provide a series of pre-trained weights for you to use. Weights in the table below are used to initialize ParSe/ParSeD/RLIP-ParSe/RLIP-ParSeD for pre-training or fine-tuning.

Model Pre-training Paradigm Pre-training Dataset Backbone Base Detector Download
MDETR-ParSe Modulated Detection GoldG+ ResNet-101 DETR Link
ParSeD Object Detection VG ResNet-50 DDETR Link
ParSeD Object Detection COCO ResNet-50 DDETR Link
ParSe Object Detection COCO ResNet-50 DETR Link (Query128)
Link (Query200)
ParSe Object Detection COCO ResNet-101 DETR Link (Query128)
RLIP-ParSeD RLIP VG ResNet-50 DDETR Link
RLIP-ParSeD RLIP COCO + VG ResNet-50 DDETR Link
RLIP-ParSe RLIP COCO + VG ResNet-50 DETR Link

With respect to the first, third and fourth line of the pre-trained weights, they are produced from the original codebase. For further reference, you could visit DDETR and MDETR. The weights provided above are transformed from original codebases. With respect to the last three models' weights, optionally, you can pre-train the model yourself by running the corresponding script:

cd /PATH/TO/RLIP
# RLIP-ParSe
bash scripts/Pre-train_RLIP-ParSe_VG.sh
# RLIP-ParSeD
bash scripts/Pre-train_RLIP-ParSeD_VG.sh

Note that above scripts contain the installation of dependencies, which could be done independently. For the --pretrained parameter in script, you could ignore it to pre-train from scratch or use ParSeD parameters pre-trained on COCO.

1. Fully-finetuning

Weights in the table below are fully-fined weights of ParSe/ParSeD/RLIP-ParSe/RLIP-ParSeD using pre-trained weights from the above table.

Model Pre-training Paradigm Pre-training Dataset Backbone Base Detector Full / Rare / Non-Rare Download
ParSeD RLIP COCO ResNet-50 DDETR 29.12 / 22.23 / 31.17 Link
ParSe RLIP COCO ResNet-50 DETR 31.79 / 26.36 / 33.41 Link
ParSe RLIP COCO ResNet-101 DETR 32.76 / 28.59 / 34.01 Link
RLIP-ParSeD RLIP VG ResNet-50 DDETR 29.21 / 24.45 / 30.63 Link
RLIP-ParSeD RLIP COCO + VG ResNet-50 DDETR 30.70 / 24.67 / 32.50 Link
RLIP-ParSe RLIP COCO + VG ResNet-50 DETR 32.84 / 26.85 / 34.63 Link

2. Few-shot (0, 1%, 10%)

The scripts are identical to those for fully fine-tuning. The major difference is that we need to add --few_shot_transfer 10 \ for 10% data of few-shot transfer and --few_shot_transfer 1 \ for 1% data of few-shot transfer. Note that we only fine-tune for 10 epochs with the lr dropping at 7th epoch. Thus, you need to change --lr_drop and --epochs in the script accordingly.

cd /PATH/TO/RLIP
# RLIP-ParSeD on HICO
bash scripts/Fine-tune_RLIP-ParSeD_HICO.sh
# RLIP-ParSe on HICO
bash scripts/Fine-tune_RLIP-ParSe_HICO.sh
# ParSe on HICO
bash scripts/Fine-tune_ParSe_HICO.sh
# ParSeD on HICO
bash scripts/Fine-tune_ParSeD_HICO.sh

When there is no extra data provided (0 percent of few-shot transfer), please refer to zero-shot NF setting, but performance is present here.

Model Pre-training Paradigm Pre-training Dataset Backbone Base Detector Data Full / Rare / Non-Rare Download
RLIP-ParSeD RLIP COCO + VG ResNet-50 DDETR 0 13.92 / 11.20 / 14.73 Link*
RLIP-ParSeD RLIP COCO + VG ResNet-50 DDETR 1% 18.30 / 16.22 / 18.92 Link
RLIP-ParSeD RLIP COCO + VG ResNet-50 DDETR 10% 22.09 / 15.89 / 23.94 Link
RLIP-ParSe RLIP COCO + VG ResNet-50 DETR 0 15.40 / 15.08 / 15.50 Link*
RLIP-ParSe RLIP COCO + VG ResNet-50 DETR 1% 18.46 / 17.47 / 18.76 Link
RLIP-ParSe RLIP COCO + VG ResNet-50 DETR 10% 22.59 / 20.16 / 23.32 Link

Note: β‘  * means that the checkpoints are the same as the ones in the RLIP Pre-training table, since they do not involve any fine-tuning.

3. Zero-shot (NF, UC-RF, UC-NF)

With respect to NF setting, it is actually a testing procedure after loading the pre-trained weights. We could run the script below.

cd /PATH/TO/RLIP
# Zero-shot NF setting with RLIP-ParSe/RLIP-ParSeD
bash scripts/NF_Zero_shot.sh

With respect to UC-RF and UC-NF setting, training is required. We could run the script below by adding --zero_shot_setting UC-RF \ or --zero_shot_setting UC-NF \. Note that for UC-NF setting, we only fine-tunes for 40 epochs (lr dropping at 30th epoch) to avoid overfitting. Thus, you need to change --lr_drop 30 \ and --epochs 40 \ in the script accordingly.

cd /PATH/TO/RLIP
# RLIP-ParSeD on HICO
bash scripts/Fine-tune_RLIP-ParSeD_HICO.sh
# RLIP-ParSe on HICO
bash scripts/Fine-tune_RLIP-ParSe_HICO.sh
# ParSe on HICO
bash scripts/Fine-tune_ParSe_HICO.sh
# ParSeD on HICO
bash scripts/Fine-tune_ParSeD_HICO.sh
Model Pre-training Paradigm Pre-training Dataset Backbone Base Detector Setting Full / Rare / Non-Rare Download
RLIP-ParSe RLIP COCO + VG ResNet-50 DETR UC-RF 30.52 / 19.19 / 33.35 Link
RLIP-ParSe RLIP COCO + VG ResNet-50 DETR UC-NF 26.19 / 20.27 / 27.67 Link

Evaluation

The mAP on HICO-DET under the Full set, Rare set and Non-Rare Set will be reported during the training process.

The results for the official evaluation of V-COCO must be obtained by the generated pickle file of detection results.

cd /PATH/TO/RLIP
python generate_vcoco_official.py \
        --param_path /PATH/TO/CHECKPOINT \
        --save_path vcoco.pickle \
        --hoi_path /PATH/TO/VCOCO/DATA \

Then you should run following codes after modifying the path to get the final performance:

cd /PATH/TO/RLIP
python datasets/vsrl_eval.py

Acknowledgement

Part of this work's implemention refers to several prior works including OCN, QPIC, CDN, DETR, DDETR and MDETR.

About

[NeurIPS 2022 Spotlight] RLIP: Relational Language-Image Pre-training and a series of other methods to solve HOI detection and Scene Graph Generation.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published