This repository comprises source code for two main research objectives
- Combining various attention mechanisms to obtain a better model for two-speaker overlapping speech speaker diarization than the current state-of-the-art approaches.
The following combined attention mechanisms have been employed in the work. Combined as well as single attention mechanisms can be obtained by commenting the respective lines of code frompytorch_backend/models.py
- Self Attention + Local Dense Synthesizer Attention (HA-EEND)
- External Attention + Local Dense Synthesizer Attention
- Relative Attention + Local Dense Synthesizer Attention
- Experiments on the language dependency of EEND-based speaker diarization, and testing on combined datasets in both English and Sinhala languages
The repository largely references code from the following sources:
- EEND by Research & Development Group, Hitachi, Ltd. who holds the copyright
- EEND_PyTorch licensed under MIT License
- External-Attention-pytorch licensed under MIT License
- multihead-LDSA
- attentions licensed under MIT License
- ASR Recipes licensed under an Apache License, Version 2.0.
├── egs : middle tier files
├── asr-sinhala/v1 : Modelling on Sinhala ASR and CALLSINHALA
├── conf : configuration files
├── local : locally used scripts and other files
├── cmd.sh : file that specifies job scheduling system
├── path.sh : path file
├── run.sh : train/infer/score model
└── run_prepare_shared.sh : prepare data
├── callhome/v1 : CALLHOME test set
├── combined/v1 : Combined modelling on Sinhala ASR/LibriSpeech and test on CALLHOME
└── librispeech/v1 : Modelling on LibriSpeech and CALLHOME
├── eend : backend files
└── pytorch_backend/models.py : specify different models to be trained on
└── tools : Kaldi setup
The research was conducted in the following environment
- OS : Ubuntu 18.04 LTS
- Memory:
- For single multi-head layered encoder blocks: 8 CPUs, 32 GB RAM
- For double multi-head layered encoder blocks: 16 CPUs, 64 GB RAM
- Storage : 150-200 GB
The following requirements are to be installed
- Anaconda
- CUDA Toolkit
- SoX tool
Follow the following steps to install all the requirements and get going on the project.
sudo apt-get update
sudo apt-get install bzip2 libxml2-dev -y
wget https://repo.anaconda.com/archive/Anaconda3-2020.11-Linux-x86_64.sh (use Anaconda latest version)
bash Anaconda3-2020.11-Linux-x86_64.sh
rm Anaconda3-2020.11-Linux-x86_64.sh
source .bashrc
sudo apt install nvidia-cuda-toolkit -y
sudo apt-get install unzip gfortran python2.7 -y
sudo apt-get install automake autoconf sox libtool subversion -y
sudo apt-get update -y
sudo apt-get install -y flac
git clone https://github.com/Sachini-Dissanayaka/HA-EEND.git
cd HA-EEND/tools/
make
~/HA-EEND/tools/miniconda3/envs/eend/bin/pip install torch==1.6.0+cu101 torchvision==0.7.0+cu101 -f https://download.pytorch.org/whl/torch_stable.html
export PYTHONPATH="${PYTHONPATH}:~/HA-EEND/"
export PATH=~/HA-EEND/tools/miniconda3/envs/eend/bin/:$PATH
export PATH=~/HA-EEND/eend/bin:~/HA-EEND/utils:$PATH
export KALDI_ROOT=~/HA-EEND/tools/kaldi
export PATH=~/HA-EEND/utils/:$KALDI_ROOT/tools/openfst/bin:$KALDI_ROOT/tools/sph2pipe_v2.5:$KALDI_ROOT/tools/sctk/bin:~/HA-EEND:$PATH
Modify egs/librispeech/v1/cmd.sh
according to your job schedular.
The following datasets were used in the experiments.
- Training
- Testing
- CALLHOME portion of the 2000 NIST Speaker Recognition Evaluation Corpus
- CALLSINHALA dataset (collected by the authors)
For tests with English data:
Move the datasets (LibriSpeech and CALLHOME) into a folder with path egs/librispeech/v1/data/local
Run the following commands
cd egs/librispeech/v1
./run_prepare_shared.sh
./run.sh
- Yoshani Ranaweera : yoshani.ranaweera.17@cse.mrt.ac.lk
- Sachini Dissanayaka : sachinidissanayaka.17@cse.mrt.ac.lk
- Anjalee Sudasinghe : anjaleeps.17@cse.mrt.ac.lk