Note that the best way of training a multispeaker model is yet to be determined. This document will describe how to train some different multispeaker models that are possible in Ophelia. Festival will be used for the phonetic transcription (installation).
wget https://datashare.is.ed.ac.uk/bitstream/handle/10283/2651/VCTK-Corpus.zip
unzip VCTK-Corpus.zip
cd VCTK-Corpus
ls txt/*/*.txt | while read TXT ; do
BASE=`basename $TXT .txt`
LINE=`cat $TXT` ;
SPKR=`echo $BASE | awk -F_ '{print $1}'`
if [ $SPKR == p376 ] ; then
LINE=${LINE:1:-1} ; ## remove initial and final " in p376 data
fi
echo "$BASE||$LINE";
done >> metadata.csv
CODEDIR=/path/to/ophelia
DATADIR=/path/to/VCTK-Corpus
cd $CODEDIR
python ./script/festival/csv2scm.py -i $DATADIR/metadata.csv -o $DATADIR/utts.data
You need to be in the same directory as the utts.data file for this command to run (which should be in $DATADIR).
cd $DATADIR/
FEST=/path/to/your/installation/of/festival
SCRIPT=$CODEDIR/script/festival/make_rich_phones_combirpx_noplex.scm
$FEST -b $SCRIPT | grep ___KEEP___ | sed 's/___KEEP___//' | tee ./transcript_temp1.csv
python $CODEDIR/script/festival/fix_transcript.py ./transcript_temp1.csv > ./transcript_temp2.csv
Note that this transcription will be using RP English, which obviously doesn't match all the VCTK speakers' accents. See Prepare a new database for some notes on this.
Add speaker codes to the last fields in the transcript by either running:
awk -F_ '{print $1}' ./transcript_temp2.csv > speakers.tmp
paste -d\| ./transcript_temp2.csv speakers.tmp > ./transcript.csv
or:
python $CODEDIR/script/festival/multi_transcript.py -i ./transcript_temp2.csv -o ./transcript.csv
mkdir $DATADIR/test_set
cd $DATADIR/test_set
ls /afs/inf.ed.ac.uk/group/cstr/projects/corpus_1/Nick48kHz/txt/hvd*.txt | while read TXT ; do
BASE=`basename $TXT .txt`
LINE=`cat $TXT` ;
echo "$BASE||$LINE";
done >> harvard.csv
python ./script/festival/csv2scm.py -i $DATADIR/test_set/harvard.csv -o $DATADIR/test_set/utts.data
FEST=/path/to/your/installation/of/festival
SCRIPT=$CODEDIR/script/festival/make_rich_phones_combirpx_noplex.scm
$FEST -b $SCRIPT | grep ___KEEP___ | sed 's/___KEEP___//' | tee ./harvard_tmp.csv
python $CODEDIR/script/festival/fix_transcript.py ./harvard_tmp.csv > ./harvard_combilex_rpx.csv
The wav files should all be in one directory:
mkdir $DATADIR/wav
mv $DATADIR/wav48/p*/*.wav ./wav/
Normalise levels by running:
cd $CODEDIR
python ./script/normalise_level.py -i $DATADIR/wav -o $DATADIR/wav_norm/ -ncores 25
rm -r $DATADIR/wav
The wav directory is removed to save spaces. Thereafter, trim silences from the data:
./util/submit_tf_cpu.sh ./script/split_speech.py -w $DATADIR/wav_norm/ -o $DATADIR/wav_trim_15dB/ -dB 15 -ncores 25 -trimonly
The configuration file holds paths to the data, transcripts, etc., which should be changed to match the way your data and files are structured. Thereafter, these two commands can be run:
./util/submit_tf_cpu.sh ./prepare_acoustic_features.py -c ./config/vctk_01.cfg -ncores 25
./util/submit_tf.sh ./prepare_attention_guides.py -c ./config/vctk_01.cfg -ncores 25
The following command will output a list of phones. Paste this into the config file as 'vocab', adding '<PADDING>' as an entry to the list. It also produces a list of the speakers - these should be pasted into the config file as 'speaker_list', also adding '<PADDING>' as an entry to the list.
python ./script/check_transcript.py -i $DATADIR/transcript.csv -cmp work/vctk_01/data/mels/ -phone -speaker
The histograms that are also shown as outputs should be used to decide a good cutoff point if there is a large tail of long utterances. (Those utterances make training more inefficient. The script can be rerun with the arguments '-maxframes' and '-maxletters' while elaborating with good cutoff points. When having decided, set the variables 'max_N' and 'max_T' to the max number of characters and of coarse frames in a sentence, respectively.
The variable 'multispeaker' in the configuration file can also be changed - it specifies at what locations in the network speaker codes should be added. See the config file for possible ones. That argument can also be set to ['learn_channel_contrubutions'] for lcc.
Run the following two commands (possibly simultaneously):
./util/submit_tf.sh ./train.py -c config/vctk_01.cfg -m t2m
./util/submit_tf.sh ./train.py -c config/vctk_01.cfg -m ssrn
To train a new model, but using a previously trained SSRN, simply softlink the old one to the new model's 'work' directory:
mkdir $CODEDIR/work/<NEW_MODEL>/train-ssrn/
ln -s $CODEDIR/work/vctk_01/train-ssrn/model_epoch_4* ./work/<NEW_MODEL>/train-ssrn/
ln -s $CODEDIR/work/vctk_01/train-ssrn/checkpoint ./work/<NEW_MODEL>/train-ssrn/
The only thing to note here is that the '-speaker' argument needs to be present, giving a speaker that was present during training:
./util/submit_tf.sh ./synthesize.py -c config/vctk_01.cfg -N 10 -speaker <SPEAKER>
- vctk_01: only adds speaker codes at the audio_decoder_input
- vctk_02: same ssrn as vctk_01, adds speaker codes at the audio_decoder_input AND text_encoder_towards_end
- vctk_03: learned channel contributions from the c.50 speakers
If wanting to fine-tune a multispeaker model to a single speaker (i.e. continue training), the most important differences are that in the new config file the 'speaker_list' must also include the new speaker. This also means that the 'nspeakers' needs to be one less than in the multispeaker model, e.g.:
nspeakers = len(speaker_list) + 99
Further, there needs to be an argument like the following, pointing to the trained multispeaker models:
initialise_weights_from_existing = [('Text2Mel', WORK+'/<OLD_MODEL>/train-t2m/model_epoch_<LAST_EPOCH>'), ('SSRN', WORK+'/<OLD_MODEL>/train-ssrn/model_epoch_<LAST_EPOCH>')]