An extensible speech synthesis system, build with PyTorch and the original code is from r9y9's https://github.com/r9y9/nnmnkwii_gallery . You will find it easy to train acoustic model by employing popular models such as tacotron's encoder, deepvoice's encoder, transformer's encoder and any other you created.
Note: the repo requires wav files with aligned HTS-style full-context lablel files.
-
Download a dataset
-
Unpack the dataset into
~/ExtensibleTTS-PyTorch/datasets
After unpacking, your tree should look like this for cmu_slt_arctic:
ExtensibleTTS-PyTorch |- datasets |- slt_arctic_full_data |- label_phone_align |- label_state_align |- wav |- file_id_list_full.scp |- questions-radio_dnn_416.hed
- Preprocess the data to extract linguistic/duration/acoustic feature
python preprocess.py --label state_align
- Use
--label phone_align
- Count min/max/mean/var/scale value of the data for input/output feature normalization
python norm_params.py
- Train a model
python train_dnn.py --train_model duration
- Use
--train_model acoustic
for training a acoustic model
- Label to speech waveform from a duration/acoustic checkpoint
python synthesis.py --label state_align --duration_checkpint * --acoustic_checkpint *
- Restore from a checkpoint
python train.py --restore_step *
- combined with MTTS, the Mandarin frontend
- batch inference for synthesis speedup
- scheduled sampling
- model pruning