FCL-Taco2: Towards Fast, Controllable and Lightweight Text-to-Speech synthesis (ICASSP 2021) Paper | Demo
Block diagram of FCL-taco2, where the decoder generates mel-spectrograms in AR mode within each phoneme and is shared for all phonemes.
- python 3.6.10
- torch 1.3.1
- chainer 6.0.0
- espnet 8.0.0
- apex 0.1
- numpy 1.19.1
- kaldiio 2.15.1
- librosa 0.8.0
- Step1. Data preparation & preprocessing
-
Download LJSpeech
-
Unpack downloaded LJSpeech-1.1.tar.bz2 to /xx/LJSpeech-1.1
-
Obtain the forced alignment information by using Montreal forced aligner tool. Or you can download our alignment results, then unpack it to /xx/TextGrid
-
Preprocess the dataset to extract mel-spectrograms, phoneme duration, pitch, energy and phoneme sequence by:
python preprocessing.py --data-root /xx/LJSpeech-1.1 --textgrid-root /xx/TextGrid
- Step2. Model training
-
Training teacher model FCL-taco2-T:
./teacher_model_training.sh
-
Training student model FCL-taco2-S:
./student_model_training.sh
-
Parallel-WaveGAN vocoder training: follow instructions at here. You can also download the pre-trained PWG vocoder, and put the PWG model under the directory "vocoder".
- Step3. Model evaluation
-
FCL-taco2-T evaluation:
./inference_teacher.sh
-
FCL-taco2-S evaluation:
./inference_student.sh
If the code is used in your research, please star our repo and cite our paper:
@inproceedings{wang2021fcl,
title={Fcl-Taco2: Towards Fast, Controllable and Lightweight Text-to-Speech Synthesis},
author={Wang, Disong and Deng, Liqun and Zhang, Yang and Zheng, Nianzu and Yeung, Yu Ting and Chen, Xiao and Liu, Xunying and Meng, Helen},
booktitle={ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
pages={5714--5718},
year={2021},
organization={IEEE}
}