This repository implements the boundaries head proposed in the paper:
Hanyuan Wang, Majid Mirmehdi, Dima Damen, Toby Perrett, Centre Stage: Centricity-based Audio-Visual Temporal Action Detection, VUA, 2023
This repository is based on ActionFormer.
When using this code, kindly reference:
@INPROCEEDINGS{Hanaudiovisual,
author={Wang, Hanyuan and Mirmehdi, Majid and Damen, Dima and Perrett, Toby}
booktitle={The 1st Workshop in Video Understanding and its Applications (VUA 2023)},
title={Centre Stage: Centricity-based Audio-Visual Temporal Action Detection},
year={2023}}
- Python 3.5+
- PyTorch 1.11
- CUDA 11.0+
- GCC 4.9+
- TensorBoard
- NumPy 1.11+
- PyYaml
- Pandas
- h5py
- joblib
Complie NMS code by:
cd ./libs/utils
python setup.py install --user
cd ../..
You can download the annotation repository of EPIC-KITCHENS-100 at here. Place it into a folder: ./data/visual_feature/epic_kitchens/annotations.
You can download the videos of EPIC-KITCHENS-100 at here.
You can download the visual feature on EPIC-KITCHENS-100 at here. Place it into a folder: ./data/visual_feature/epic_kitchens/features.
You can extract the audio feature on EPIC-KITCHENS-100 follow this repository here. Place extracted features into a folder: ./data/audio_feature/extracted_features_retrain_small_win.
If everything goes well, you can get the folder architecture of ./data like this:
data
├── audio_feature
├ └── extracted_features_retrain_small_win
└── visual_feature
└── epic_kitchens
├── features
└── annotations
You can download our pretrained models on EPIC-KITCHENS-100 at here.
To train the model run:
python ./train.py ./configs/epic_slowfast.yaml --output reproduce --loss_act_weight 1.7 --cen_gau_sigma 1.7 --loss_weight_boundary_conf 0.5
To validate the model run:
python ./eval.py ./configs/epic_slowfast.yaml ./ckpt/epic_slowfast_reproduce/name_of_the_best_model
[RESULTS] Action detection results_self.ap_action
|tIoU = 0.10: mAP = 20.88 (%)
|tIoU = 0.20: mAP = 20.13 (%)
|tIoU = 0.30: mAP = 18.92 (%)
|tIoU = 0.40: mAP = 17.51 (%)
|tIoU = 0.50: mAP = 15.03 (%)
Avearge mAP: 18.50 (%)
[RESULTS] Action detection results_self.ap_noun
|tIoU = 0.10: mAP = 26.78 (%)
|tIoU = 0.20: mAP = 25.58 (%)
|tIoU = 0.30: mAP = 23.91 (%)
|tIoU = 0.40: mAP = 21.45 (%)
|tIoU = 0.50: mAP = 17.68 (%)
Avearge mAP: 23.08 (%)
[RESULTS] Action detection results_self.ap_verb
|tIoU = 0.10: mAP = 24.11 (%)
|tIoU = 0.20: mAP = 23.00 (%)
|tIoU = 0.30: mAP = 21.66 (%)
|tIoU = 0.40: mAP = 20.16 (%)
|tIoU = 0.50: mAP = 16.57 (%)
Avearge mAP: 21.10 (%)
This implementation is based on ActionFormer.