Welcome to Predicting-Speaker-Quality! This repository contains the code used for my Bachelor's thesis with the title Predicting Speaker Quality Using Embeddings. All of it is research code written by an inexperienced undergraduate student, so please don't expect perfect documentation. However, if you run into any troubles or even want to improve or add to the code base, don't hesitate to reach out to me. Found a mistake? Let me know as well.
Besides just reading this README file, a good idea to delve into the topic might also be to read the resulting thesis itself, which is included in this repository as Predicting Speaker Quality Using Embeddings.pdf
.
To set up the project, follow these steps:
- Clone this repository.
- Install the requirements from
requiremente.txt
usingpip install -r requirements.txt
if they are not already satisfied. If you like, you can do this in a virtual environment to keep things tidy.
- Download the Spoken Wikipedia Corpus (German, with audio) from https://nats.gitlab.io/swc/ and replace the directory
german
with it. - Navigate into the main project directory and execute the
split.sh
script usingbash split.sh -m 10 -d 10 -p
, which will generate up to 10 samples of length 10 seconds from each audio file in thewavs
directory and its subdirectories. This may take a while. To see all available options, typebash split.sh -h
. - Generate the GE2E and TRILL embeddings by running the
update_embeddings.py
script once. If you want to create new embeddings, for example because you have new .wav files in your demo folder, just run it again. It will remember which embeddings have already been created and delete embeddings that are no longer needed. - Navigate into the
feature-scripts
directory and execute theupdate_audio_features.sh
script usingbash update_audio_features.sh
. Just like the previous script, this one does all the bookkeeping for you and tracks new and deleted .wav files.
- In order to train and evaluate the neural network models (DNNs and LSTMs), simply run the
keras_regressors.py
script. All parameters like network architecture, learning rate, etc. can be modified inside the file itself. - For the kNN and random forest regressor, use the
sklearn_regressors.py
file. Like before, all parameters can be set inside the script itself.
If you want to create plots from the resulting predictions (just like the ones seen in the thesis), take a look at the individual plotting scripts inside plot-scripts
.
In order to evaluate the audio recordings inside wavs/demo
, please use the script demo.py
.
The code in the encoder
directory, which generates the GE2E embeddings, is forked from Corentin Jemine (https://github.com/CorentinJ/Real-Time-Voice-Cloning) and available in a better documented format under the name Resemblyzer
.