This project contains:
- Text-independent Speaker recognition module based on VGG-Speaker-recognition
- Speaker diarization based on UIS-RNN.
- Mainly borrowed from UIS-RNN and VGG-Speaker-recognition, just link the 2 projects by generating speaker embeddings to make everything easier, and also provide an intuitive display panel
- pytorch 1.3.0
- keras
- Tensorflow 1.8-1.15
- pyaudio (About how to install on windows, refer to pyaudio_portaudio)
cd ghostvlad
python predict.py
The confusion matrix of 4 persons utterances is as below
0.00 0.32 0.40 | 0.70 0.62 0.76 | 0.81 0.83 0.76 | 0.92 0.83 0.89 |
0.32 0.00 0.48 | 0.68 0.58 0.76 | 0.87 0.84 0.83 | 0.92 0.82 0.86 |
0.40 0.48 0.00 | 0.71 0.65 0.74 | 0.79 0.81 0.72 | 0.90 0.84 0.85 |
********************************************************************************
0.70 0.68 0.71 | 0.00 0.35 0.30 | 0.78 0.81 0.76 | 0.80 0.81 0.80 |
0.62 0.58 0.65 | 0.35 0.00 0.45 | 0.76 0.71 0.73 | 0.82 0.77 0.77 |
0.76 0.76 0.74 | 0.30 0.45 0.00 | 0.83 0.83 0.80 | 0.83 0.84 0.80 |
********************************************************************************
0.81 0.87 0.79 | 0.78 0.76 0.83 | 0.00 0.40 0.46 | 0.76 0.80 0.86 |
0.83 0.84 0.81 | 0.81 0.71 0.83 | 0.40 0.00 0.45 | 0.80 0.78 0.82 |
0.76 0.83 0.72 | 0.76 0.73 0.80 | 0.46 0.45 0.00 | 0.85 0.85 0.84 |
********************************************************************************
0.92 0.92 0.90 | 0.80 0.82 0.83 | 0.76 0.80 0.85 | 0.00 0.41 0.44 |
0.83 0.82 0.84 | 0.81 0.77 0.84 | 0.80 0.78 0.85 | 0.41 0.00 0.41 |
0.89 0.86 0.85 | 0.80 0.77 0.80 | 0.86 0.82 0.84 | 0.44 0.41 0.00 |
********************************************************************************
Thanks to the authors of VGG, they are kind enough to provide the code and pre-trained model.
Their paper can refer to UTTERANCE-LEVEL AGGREGATION FOR SPEAKER RECOGNITION IN THE WILD
It's a novel idea that combines netvlad/ghostvlad
which popularly used in image recognition to speaker recognition, the state-of-the-art in the past was i-vector
based, which depended on the GMM
model and pLDA
.
About VGG speaker model, I have re-implemented in tensorflow, ghostvlad-speaker and corresponding pretrained model.
This project only shows how to generate speaker embeddings using pre-trained model for uis-rnn training in later.
The training project link to VGG-Speaker-Recognition
- http://www.openslr.org/38 contains 855 speakers and 120 utterances of Chinese Mandarin in each, so there are 102600 utterances in total.
- VCTK contains 109 speakers of English.
- VoxCeleb1 contains 1251 speakers.
- VoxCeleb2 contains 6112 speakers.
How to generate speaker embeddings for the next training stage:
python generate_embeddings.py
You may need to change the dataset path by your own.
python train.py
The speaker embeddings generated by vgg are all non-negative vectors, and contained many zero elements. The uis-rnn seems abnormally deal with these data somehow, shows as below
Iter: 0 Training Loss: nan
Negative Log Likelihood: 7.3020 Sigma2 Prior: nan Regularization: 0.0007
Iter: 10 Training Loss: nan
Negative Log Likelihood: nan Sigma2 Prior: nan Regularization: nan
Iter: 20 Training Loss: nan
Negative Log Likelihood: nan Sigma2 Prior: nan Regularization: nan
When I added an insignificate bias (e.g. 0.00001) to each element of vectors, error disappeared.
Iter: 0 Training Loss: -581.8732
Negative Log Likelihood: 7.0125 Sigma2 Prior: -588.8864 Regularization: 0.0007
Iter: 10 Training Loss: -614.1193
Negative Log Likelihood: 1.7536 Sigma2 Prior: -615.8737 Regularization: 0.0007
Iter: 20 Training Loss: -644.9244
Negative Log Likelihood: 1.7123 Sigma2 Prior: -646.6375 Regularization: 0.0007
python speakerDiarization.py
The Result is showing as below:(3 speakers)
========= 0 =========
0:00.288 ==> 0:04.406
0:07.699 ==> 0:16.461
0:33.921 ==> 0:35.8
========= 1 =========
0:04.406 ==> 0:07.699
0:16.461 ==> 0:19.594
0:30.371 ==> 0:33.921
0:41.19 ==> 0:44.185
========= 2 =========
0:19.594 ==> 0:30.371
0:35.8 ==> 0:41.19
The final result is influenced by the size of each window and the overlap rate. When the overlap is too large, the uis-rnn perhaps generates fewer speakers since the speaker embeddings changed smoothly, otherwise will generate more speakers. And also, the window size cannot be too short, it must contain enough information to generate more discrimitive speaker embeddings.