XKD: Cross-modal Knowledge Distillation with Domain Alignment for Video Representation Learning

AAAI 2024

Pritam Sarkar Ali Etemad

[Paper] [Code] [Website]

Updates

Abstract

We present XKD, a novel self-supervised framework to learn meaningful representations from unlabelled videos. XKD is trained with two pseudo objectives. First, masked data reconstruction is performed to learn modality-specific representations from audio and visual streams. Next, self-supervised cross-modal knowledge distillation is performed between the two modalities through a teacher-student setup to learn complementary information. We introduce a novel domain alignment strategy to tackle domain discrepancy between audio and visual modalities enabling effective cross-modal knowledge distillation. Additionally, to develop a general-purpose network capable of handling both audio and visual streams, modality-agnostic variants of XKD are introduced, which use the same pretrained backbone for different audio and visual tasks. Our proposed cross-modal knowledge distillation improves video action classification by 8% to 14% on UCF101, HMDB51, and Kinetics400. Additionally, XKD improves multimodal action classification by 5.5% on Kinetics-Sound. XKD shows state-of-the-art performance in sound classification on ESC50, achieving top-1 accuracy of 96.5%.

Result

Results

Ablation study

Effect of refinement

Environment Setup

conda create --name xkd --file requirements.txt

Datasets

The sources of all the public datasets used in this study are mentioned here.

AudioSet: Please check this repository to download AudioSet.
Kinetics400: You can either use a crawler (similar to the one available for AudioSet) to download the Kinetics400, or simply download from the Amazon AWS, prepared by CVD Foundation.
UCF101: Website to download.
HMDB51: Website to download.
ESC50: Website to download.
FSD50K: Website to download.
Kinetics-Sound: This is a subset of Kinetics400

Self-supervised Training

We provide a sample script to train XKD with Kinetics400.

cd codes/train/src

# w/ modality-specific student and teachers (default)
sbatch xkd.sh 'xkd.yaml'

# w/ modality-agnostic students and modality-specific teachers encoders
sbatch xkd.sh 'xkd_mas.yaml'

# w/ modality-agnostic students and modality-agnostic teachers encoders
sbatch xkd.sh 'xkd_mats.yaml'

Downstream Evaluation

The bash scripts are available here. Please make sure to update the paths in the scripts before running evaluation.

cd codes/eval/src/

Video - Finetune

UCF101

# for eval on 3 folds
sbatch finetune_ucf101.sh config_f1
sbatch finetune_ucf101.sh config_f2
sbatch finetune_ucf101.sh config_f3

HMDB51

# for eval on 3 folds
sbatch finetune_hmdb51.sh config_f1
sbatch finetune_hmdb51.sh config_f2
sbatch finetune_hmdb51.sh config_f3

Kinetics400

sbatch finetune_k400.sh config

Audio - Fintune

FSD50K

sbatch finetune_fsd50k.sh config

ESC50

sbatch finetune_esc50.sh config_f1
sbatch finetune_esc50.sh config_f2
sbatch finetune_esc50.sh config_f3
sbatch finetune_esc50.sh config_f4
sbatch finetune_esc50.sh config_f5

Video - Linear

UCF101

# for eval on 3 folds
sbatch svm_ucf101.sh config_f1
sbatch svm_ucf101.sh config_f2
sbatch svm_ucf101.sh config_f3

HMDB51

# for eval on 3 folds
sbatch svm_hmdb51.sh config_f1
sbatch svm_hmdb51.sh config_f2
sbatch svm_hmdb51.sh config_f3

Kinetics400

sbatch linear_k400.sh config

Audio - Linear

FSD50K

sbatch linear_fsd50k.sh config

ESC50

sbatch svm_esc50.sh config_f1
sbatch svm_esc50.sh config_f2
sbatch svm_esc50.sh config_f3
sbatch svm_esc50.sh config_f4
sbatch svm_esc50.sh config_f5

Audio-visual

Kinetics-Sound

sbatch svm_ks_audvid.sh config

Citation

If you find this repository useful, please consider giving a star ⭐ and citation using the given BibTeX entry:

@misc{sarkar2022xkd,
      title={XKD: Cross-modal Knowledge Distillation with Domain Alignment for Video Representation Learning}, 
      author={Pritam Sarkar and Ali Etemad},
      year={2022},
      eprint={2211.13929},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}

Acknowledgments

We are grateful to Bank of Montreal and Mitacs for funding this research. We are also thankful to SciNet HPC Consortium for helping with the computation resources.

Question

You may directly contact me at pritam.sarkar@queensu.ca or connect with me on LinkedIn.

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
codes		codes
docs		docs
.DS_Store		.DS_Store
.gitignore		.gitignore
LICENSE		LICENSE
README.MD		README.MD
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

XKD: Cross-modal Knowledge Distillation with Domain Alignment for Video Representation Learning

AAAI 2024

Pritam Sarkar Ali Etemad

[Paper] [Code] [Website]

Updates

Abstract

Result

Results

Ablation study

Effect of refinement

Environment Setup

Datasets

Self-supervised Training

Downstream Evaluation

Video - Finetune

Audio - Fintune

Video - Linear

Audio - Linear

Audio-visual

Citation

Acknowledgments

Question

About

Releases 1

Packages

Languages

License

pritamqu/XKD

Folders and files

Latest commit

History

Repository files navigation

XKD: Cross-modal Knowledge Distillation with Domain Alignment for Video Representation Learning

AAAI 2024

Pritam Sarkar Ali Etemad

[Paper] [Code] [Website]

Updates

Abstract

Result

Results

Ablation study

Effect of refinement

Environment Setup

Datasets

Self-supervised Training

Downstream Evaluation

Video - Finetune

Audio - Fintune

Video - Linear

Audio - Linear

Audio-visual

Citation

Acknowledgments

Question

About

Topics

Resources

License

Stars

Watchers

Forks

Releases 1

Packages 0

Languages

Packages