This repository is dedicated to the customization and training of VITS (Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech) for text-to-speech (TTS) applications using Vietnamese language data, utilizing the TTS Coqui framework. The repository contains the necessary code and resources to train VITS specifically for generating high-quality speech from Vietnamese text.
- I highly recommend you to use conda virtual environment, with Python 3.11.5.
conda create -n vits python=3.10
- In this repo, I use TTS framework version 0.17.5 for statibility.
pip install TTS==0.17.5
- Infore: a single speaker Vietnamese dataset with 14935 short audio clips of a female speaker.
- After downloading and extracting dataset zip file, the directory tree should be like the image below with infore_16k_denoised folder contains all the .wav files and the metadata.tsv file contains all the wav filenames and their texts.
- To load data samples, you have to define your formater function. I have defined my own formater function for this dataset in
formater/customformater.py
, you can customize your own for other datasets.
Run the following command
python train_vits.py \
--output_path [output path for the training process] \
--data_path [path to the dataset directory] \
--restore_path [path to a pretrain model checkpoint] \
--epoch [number of epochs] \
--batch_size [batch size] \
--eval_batch_size [eval batch size] \
--continue_path [Path to a training folder to continue training] \
--sample_rate [sample rate of the audio data] \
--meta_filename [name of the metadata file] \
Please check the inference.py file
My trained model is published on this HuggingFace space. Because of the hardware condition, my model's voice is not very natural, I will try to improve the voice quality in the future :))).