In our paper, we proposed BEdit-TTS: Text-Based Speech Editing System with Bidirectional Transformers. We provide our code as open source in this repository. Samples are also available at https://liangzheng-zl.github.io/bedit-web
The model code is at espnet/nets/pytorch_backend/e2e_tts_bedit.py
The system is built on ESPnet. Before running the model, please install ESPnet. This model requires Python 3.7+ and Pytorch 1.10+. Other packages are listed in requirements.yaml.
To obtain duration information, you can use the kaldi tool to train the GMM-HMM model to achieve forced alignment.
To prepare the data of BEdit-TTS:
bash run.sh --stage 0 --stop_stage 0
bash pre_bedit_data.sh --stage 1 --stop_stage 2 # for training data
# bash pre_bedit_data.sh --stage 1 --stop_stage 4 # for decoding data
To apply CMVN:
bash run.sh --stage 1 --stop_stage 1
To prepar dictionary and json data:
bash run.sh --stage 2 --stop_stage 2
To update json data:
bash run.sh --stage 3 --stop_stage 3
To train the model:
bash run.sh --stage 4 --stop_stage 4
To generate spectrum:
bash run.sh --stage 5 --stop_stage 5
The waveform can be synthesized by a pre-trained HiFiGAN.