PyTorch Implementation of CTCNet (TPAMI 2024): An Audio-Visual Speech Separation Model Inspired by Cortico-Thalamo-Cortical Circuits.
- The performance of multimodal speech separation is greatly improved.
- Incorporating brain inspiration into network design to improve model performance.
- For real scenes can still get better results.
This method involves using the LRS2, LRS3, and Vox2 datasets to create a multimodal speech separation dataset. The corresponding folders Datasets/ in the provided GitHub repository contain the files necessary to build the datasets, and the code in the repository can be used to construct the multimodal datasets.
The generated datasets (LRS2-2Mix, LRS3-2Mix, and VoxCeleb2-2Mix) can be downloaded at the links below.
Datasets | Links | Pretrained Models |
LRS2-2Mix | Removed for copyright | Google Driver |
LRS3-2Mix | Removed for copyright | Google Driver |
VoxCeleb2-2Mix | Removed for copyright | Google Driver |
This pre-trained model is a lip-reading model trained only on videos, and it achieves an accuracy of 84% on the LRW dataset.
Datasets | Links | Pretrained Models |
LRS2-2Mix | Removed for copyright | Google Driver |
- torch 1.13.1+cu116
- torchaudio 0.13.1+cu116
- torchvision 0.14.1+cu116
- pytorch-lightning 1.8.4.post0
- torch-mir-eval 0.4
- torch-optimizer 0.3.0
- fast-bss-eval 0.1.4
- pandas 1.5.1
- rich 10.16.2
- opencv-python
python --in_audio_dir audio/wav16k/min --in_mouth_dir mouths --out_dir data
python -c local/lrs2_conf_64_64_3_adamw_1e-1_blocks16_pretrain.yml
python -c local/lrs3_conf_64_64_3_adamw_1e-1_blocks16_pretrain.yml
python -c local/vox2_conf_64_64_3_adamw_1e-1_blocks16_pretrain.yml
python --test=local/data/tt --conf_dir=exp/lrs2_64_64_3_adamw_1e-1_blocks8_pretrain/conf.yml
ffmpeg -i ./test_videos/interview.mp4 -filter:v fps=fps=25 ./test_videos/interview25fps.mp4
mv ./test_videos/interview25fps.mp4 ./test_videos/interview.mp4
python ./utils/ --video_input_path ./test_videos/interview.mp4 --output_path ./test_videos/interview/ --number_of_speakers 2 --scalar_face_detection 1.5 --detect_every_N_frame 8
ffmpeg -i ./test_videos/interview.mp4 -vn -ar 16000 -ac 1 -ab 192k -f wav ./test_videos/interview/interview.wav
python ./utils/ --video-direc ./test_videos/interview/faces/ --landmark-direc ./test_videos/interview/landmark/ --save-direc ./test_videos/interview/mouthroi/ --convert-gray --filename-path ./test_videos/interview/filename_input/interview.csv
This implementation uses parts of the code from the following Github repos: Asteroid as described in our code.
If you find this code useful in your research, please cite our work:
title={An audio-visual speech separation model inspired by cortico-thalamo-cortical circuits},
author={Li, Kai and Xie, Fenghua and Chen, Hang and Yuan, Kexin and Hu, Xiaolin},
journal={IEEE Transactions on Pattern Analysis and Machine Intelligence},