Check out a video training guide by Thorsten Müller
For Windows, see ssamjh's guide using WSL
Training a voice for Piper involves 3 main steps:
- Preparing the dataset
- Training the voice model
- Exporting the voice model
Choices must be made at each step, including:
- The model "quality"
- low = 16,000 Hz sample rate, smaller voice model
- medium = 22,050 Hz sample rate, smaller voice model
- high = 22,050 Hz sample rate, larger voice model
- Single or multiple speakers
- Fine-tuning an existing model or training from scratch
- Exporting to onnx or PyTorch
Start by installing system dependencies:
sudo apt-get install python3-dev
Then create a Python virtual environment:
cd piper/src/python
python3 -m venv .venv
source .venv/bin/activate
pip3 install --upgrade pip
pip3 install --upgrade wheel setuptools
pip3 install -e .
Run the build_monotonic_align.sh
script in the src/python
directory to build the extension.
Ensure you have espeak-ng installed (sudo apt-get install espeak-ng
).
The Piper training scripts expect two files that can be generated by python3 -m piper_train.preprocess
:
- A
config.json
file with the voice settingsaudio
(required)sample_rate
- audio rate in hertz
espeak
(required)language
- espeak-ng voice or alphabet
num_symbols
(required)- Number of phonemes in the model (typically 256)
num_speakers
(required)- Number of speakers in the dataset
phoneme_id_map
(required)- Map from a phoneme (UTF-8 codepoint) to a list of ids
- Id 0 ("_") is padding (pad)
- Id 1 ("^") is the beginning of an utterance (bos)
- Id 2 ("$") is the end of an utterance (eos)
- Id 3 (" ") is a word separator (whitespace)
phoneme_type
speaker_id_map
- Map from a speaker name to id
phoneme_map
- Map from a phoneme (UTF-8 codepoint) to a list of phonemes
inference
noise_scale
- noise added to the generator (default: 0.667)length_scale
- speaking speed (default: 1.0)noise_w
- phoneme width variation (default: 0.8)
- A
dataset.jsonl
file with one line per utterance (JSON objects)phoneme_ids
(required)- List of ids for each utterance phoneme (0 <= id <
num_symbols
)
- List of ids for each utterance phoneme (0 <= id <
audio_norm_path
(required)- Absolute path to normalized audio file (
.pt
)
- Absolute path to normalized audio file (
audio_spec_path
(required)- Absolute path to audio spectrogram file (
.pt
)
- Absolute path to audio spectrogram file (
speaker_id
(required for multi-speaker)- Id of the utterance's speaker (0 <= id <
num_speakers
)
- Id of the utterance's speaker (0 <= id <
audio_path
- Absolute path to original audio file
text
- Original text of utterance before phonemization
phonemes
- Phonemes from utterance text before converting to ids
speaker
- Name of utterance speaker (from
speaker_id_map
)
- Name of utterance speaker (from
The pre-processing script expects data to be a directory with:
metadata.csv
- CSV file with text, audio filenames, and speaker nameswav/
- directory with audio files
The metadata.csv
file uses |
as a delimiter, and has 2 or 3 columns depending on if the dataset has a single or multiple speakers.
There is no header row.
For single speaker datasets:
id|text
where id
is the name of the WAV file in the wav
directory. For example, an id
of 1234
means that wav/1234.wav
should exist.
For multi-speaker datasets:
id|speaker|text
where speaker
is the name of the utterance's speaker. Speaker ids will automatically be assigned based on the number of utterances per speaker (speaker id 0 has the most utterances).
An example of pre-processing a single speaker dataset:
python3 -m piper_train.preprocess \
--language en-us \
--input-dir /path/to/dataset_dir/ \
--output-dir /path/to/training_dir/ \
--dataset-format ljspeech \
--single-speaker \
--sample-rate 22050
The --language
argument refers to an espeak-ng voice by default, such as de
for German.
To pre-process a multi-speaker dataset, remove the --single-speaker
flag and ensure that your dataset has the 3 columns: id|speaker|text
Verify the number of speakers in the generated config.json
file before proceeding.
Once you have a config.json
, dataset.jsonl
, and audio files (.pt
) from pre-processing, you can begin the training process with python3 -m piper_train
For most cases, you should fine-tune from an existing model. The model must have the sample audio quality and sample rate, but does not necessarily need to be in the same language.
It is highly recommended to train with the following Dockerfile
:
FROM nvcr.io/nvidia/pytorch:22.03-py3
RUN pip3 install \
'pytorch-lightning'
ENV NUMBA_CACHE_DIR=.numba_cache
As an example, we will fine-tune the medium quality lessac voice. Download the .ckpt
file and run the following command in your training environment:
python3 -m piper_train \
--dataset-dir /path/to/training_dir/ \
--accelerator 'gpu' \
--devices 1 \
--batch-size 32 \
--validation-split 0.0 \
--num-test-examples 0 \
--max_epochs 10000 \
--resume_from_checkpoint /path/to/lessac/epoch=2164-step=1355540.ckpt \
--checkpoint-epochs 1 \
--precision 32
Use --quality high
to train a larger voice model (sounds better, but is much slower).
You can adjust the validation split (5% = 0.05) and number of test examples for your specific dataset. For fine-tuning, they are often set to 0 because the target dataset is very small.
Batch size can be tricky to get right. It depends on the size of your GPU's vRAM, the model's quality/size, and the length of the longest sentence in your dataset. The --max-phoneme-ids <N>
argument to piper_train
will drop sentences that have more than N
phoneme ids. In practice, using --batch-size 32
and --max-phoneme-ids 400
will work for 24 GB of vRAM (RTX 3090/4090).
If you're training a multi-speaker model, use --resume_from_single_speaker_checkpoint
instead of --resume_from_checkpoint
. This will be much faster than training your multi-speaker model from scratch.
To test your voice during training, you can use these test sentences or generate your own with piper-phonemize. Run the following command to generate audio files:
cat test_en-us.jsonl | \
python3 -m piper_train.infer \
--sample-rate 22050 \
--checkpoint /path/to/training_dir/lightning_logs/version_0/checkpoints/*.ckpt \
--output-dir /path/to/training_dir/output"
The input format to piper_train.infer
is the same as dataset.jsonl
: one line of JSON per utterance with phoneme_ids
and speaker_id
(multi-speaker only). Generate your own test file with piper-phonemize:
lib/piper_phonemize -l en-us --espeak-data lib/espeak-ng-data/ < my_test_sentences.txt > my_test_phonemes.jsonl
Check on your model's progress with tensorboard:
tensorboard --logdir /path/to/training_dir/lightning_logs
Click on the scalars tab and look at both loss_disc_all
and loss_gen_all
. In general, the model is "done" when loss_disc_all
levels off. We've found that 2000 epochs is usually good for models trained from scratch, and an additional 1000 epochs when fine-tuning.
When your model is finished training, export it to onnx with:
python3 -m piper_train.export_onnx \
/path/to/model.ckpt \
/path/to/model.onnx
cp /path/to/training_dir/config.json \
/path/to/model.onnx.json
The export script does additional optimization of the model with onnx-simplifier.
If the export is successful, you can now use your voice with Piper:
echo 'This is a test.' | \
piper -m /path/to/model.onnx --output_file test.wav