Skip to content

Repository for the paper, "Multimodal Emotion Recognition using Audio-Video Transformer Fusion with Cross Attention."

License

Notifications You must be signed in to change notification settings

shravan-18/AVTCA

Repository files navigation

Multimodal-Emotion-Recognition-using-AVTCA

This repository implements a multimodal network for emotion recognition using the Audio-Video Transformer Fusion with Cross Attention (AVT-CA) model, as given in the paper Multimodal Emotion Recognition using Audio-Video Transformer Fusion with Cross Attention. The implementation supports the RAVDESS dataset, which includes speech and frontal face view data across 8 distinct emotions: 01 = neutral, 02 = calm, 03 = happy, 04 = sad, 05 = angry, 06 = fearful, 07 = disgust, and 08 = surprised.

drawing

AVT-CA Model Diagram

Feel free to play around with the code, and let us know if you have any questions or face any issues!

Citation

If you use our work, please cite as:

@misc{AVTCA,
      title={Multimodal Emotion Recognition using Audio-Video Transformer Fusion with Cross Attention}, 
      author={Joe Dhanith P R and Shravan Venkatraman and Modigari Narendra and Vigya Sharma and Santhosh Malarvannan and Amir H. Gandomi},
      year={2024},
      eprint={2407.18552},
      archivePrefix={arXiv},
      primaryClass={cs.MM},
      url={https://arxiv.org/abs/2407.18552}, 
}

If you are referencing our work, please also cite the following related paper:

Chumachenko, K., Iosifidis, A., & Gabbouj, M. (2022). Self-attention fusion for audiovisual emotion recognition with incomplete data. arXiv. https://arxiv.org/abs/2201.11095

References

This work incorporates EfficientFace, available at EfficientFace GitHub repository. Please cite the paper titled "Robust Lightweight Facial Expression Recognition Network with Label Distribution Training" if you use EfficientFace. We appreciate @zengqunzhao for providing both the implementation and the pretrained model for EfficientFace!

The training pipeline code has been adapted from Efficient-3DCNNs GitHub repository, which is licensed under the MIT license. Additionally, parts of the fusion implementation are based on the timm library, available under the Apache 2.0 license. For data preprocessing, we utilized facenet-pytorch.

About

Repository for the paper, "Multimodal Emotion Recognition using Audio-Video Transformer Fusion with Cross Attention."

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages