Implementation of TTS based on paper P-Flow: A Fast and Data-Efficient Zero-Shot TTS through Speech Prompting. You can check main differences between implementation and paper in Differences section.
I have two goals to achieve in this project. It seems work but, really poor at Japanese and numbers.
- First, I want to test character-based input with SeamlessM4T's Aligner for English, Korean, Japanese and other languages. but, mainly for three languages mentioned above.
- Second, zero-shot multilingual TTS model. since this model will be trained with sentencepiece tokenizer input, it does not need phonemizer. so, it would be easily adapted to other languages tokenizer supports. check out supported languages of tokenizer here
Generated Samples from model trained on LibriTTS-R, korean and japanese corpus of AIHub 131 datasets. All samples are decoded with MultiBand-Diffusion model from AudioCraft. Pretrained checkpoint used here is available on huggingface.
you can check how to use it in sample notebook.
Currently, speaker embedding of multi-lingual model seems to be highly entangled with language info. it shows worse zero-shot capability. I'm planning to train new model with language ID to reduce language bias in speaker embedding.
Code-switch Text: There's famous japanese sentence, つきがきれいですね, which means 나는 당신을 사랑합니다.
English Prompt Generation
code_switch_libri.mp4
Japanese Prompt Generation
code_switch_jsut.mp4
Korean Prompt Generation
code_switch_kss.mp4
English Text: P-Flow encodec is Text-to-Speech model trained on Encodec latent space, using Flow Matching.
Prompt Audio (from LibriTTS-R)
libritts_r_prompt.mp4
Generated Audio
libritts_r_sample.mp4
Japanese Text: こんにちは、初めまして。あなたの名前はなんですか?これは音声合成モデルから作られた音声です。
Prompt Audio (from JSUT)
jsut_prompt.mp4
Generated Audio
jsut_sample.mp4
Korean Text: 백남준은 미디어 아트의 개척자로서 다양한 테크놀로지를 이용하여 실험적이고 창의적으로 작업했다.
Prompt Audio (from KSS)
kss_prompt.mp4
Generated Audio
kss_sample.mp4
I've developed in WSL, Windows 11. I have not tested on other platforms and torch version. I recommend using conda environment.
conda create -n pflow-encodec -y python=3.10
conda activate pflow-encodec
conda install -y pytorch==2.2.0 torchvision==0.17.0 torchaudio==2.2.0 pytorch-cuda=12.1 -c pytorch -c nvidia
conda install -y -c conda-forge libsndfile==1.0.31
pip install -r requirements.txt
pip install -r infer-requirements.txt
First of all, you need to prepare tsv file, which contains three columns: audio_path
, text
, duration
. each column is separated by tab.
audio_path
is path to audio file, text
is transcript of audio file, and duration
is duration of audio file in seconds.
audio_path text duration
/path/to/audio1.wav Hello, World! 1.5
/path/to/audio2.wav 안녕하세요, 세계! 2.0
/path/to/audio3.wav こんにちは、世界! 2.5
Here, use encodec latent as output, and duration per token as target of duration predictor.
you can dump encodec latent and sentencepiece token durations with following command.
python scripts/dump_durations.py --input_tsv <meta_tsv_file>
python scripts/dump_latents.py --input_tsv <meta_tsv_file>
this command requires GPU and scripts/dump_durations.py
may require more than 8GB of GPU memory.
scripts/dump_durations.py
takes about 6 hours for 1000 hours of audio files. scripts/dump_latents.py
takes about 4 hours for 1000 hours of audio files. both time was measured on RTX 4090.
each script will make two files per audio file:
<audio_path stem>.latent.npy
and <audio_path stem>.duration.npy
.
NOTE: scripts/dump_latents.py
will print out global mean and std of dataset's latent. You should keep it since this value is used for training model.
Now, you can start training.
Repository's code is based on lightning-hydra-template.
After preparing dataset, you can start training after setting dataset config and experiment config. Let your dataset name be new_dataset
. first you need to set dataset config in configs/data/new_dataset.yaml
.
_target_: pflow_encodec.data.datamodule.TextLatentLightningDataModule
train_tsv_path: <train_tsv_path>
val_tsv_path: <val_tsv_path>
add_trailing_silence: True
batch_durations: 50.0 # mini-batch duration in seconds
min_duration: 3.5 # minimum duration of files, this value MUST be bigger than 3.0
max_duration: 15.0
boundaries: [3.0, 5.0, 7.0, 10.0, 15.0]
num_workers: 8
return_upsampled: False
max_frame: 1500 # 20s
text2latent_rate: 1.5 # 50Hz:75Hz
mean: <mean>
std: <std>
fill <train_tsv_path>
, <val_tsv_path>
, <mean>
, and <std>
with your dataset's meta path and mean/std values.
then, create config in configs/experiment/new_dataset.yaml
based on configs/experiment/default.yaml
.
# @package _global_
defaults:
- override /data: new_dataset.yaml # your dataset config name here!!!
- override /model: pflow_base.yaml
- override /callbacks: default.yaml
- override /trainer: gpu.yaml
- override /logger: tensorboard.yaml
task_name: pflow
tags: ["pflow"]
seed: 998244353
test: False
callbacks:
val_checkpoint:
filename: "val_latent_loss_{val/latent_loss:.4f}-{step:06d}"
monitor: val/latent_loss
mode: "min"
model:
scheduler:
total_steps: ${trainer.max_steps}
pct_start: 0.02
sample_freq: 5000
sample_idx: [] # sample indices used for sampling while train. idx will be used to choose samples from validation dataset. so this value should not be greater than len(val_dataset)
mean: ${data.mean}
std: ${data.std}
trainer:
max_steps: 500000
max_epochs: 10000 # arbitrary large number
precision: bf16-mixed # you should check if your GPU supports bf16
accumulate_grad_batches: 4 # effective batch size
gradient_clip_val: 0.2
num_nodes: 1
devices: 1
hydra:
run:
dir: <fill experiment result path>
now you can run training with following command.
python pflow_encodec/train.py experiment=new_dataset
NOTE: If you want to train model with multiple GPUs, you should adjust trainer.num_nodes and trainer.devices in experiment config. Also you should set trainer.use_distributed_sampler to be False. For more detailed information, check out Pytorch Lightning's documents.
Example of single node 4 gpus
trainer:
num_nodes: 1
devices: 4
use_distributed_sampler: False
Language | Weights | Model Card |
---|---|---|
MultiLingual(EJK) | 🤗 Hub | Link |
English | 🤗 Hub | |
Japanese | 🤗 Hub | |
Korean | 🤗 Hub |
- Implement baseline model.
- Train model on libritts-r.
- Simple gradio demo.
- Dataset preparation documentation.
- Train model on another language, i'm planning to train on Korean and Japanese.
- Multilingual model.
- Test Language ID embedding in Text Encoder for Multilingual Model
- Train small bert with SeamlessM4T's tokenizer then apply it to Text Encoder.
I did not conduct ablation studies for each changes due to lack of resources.
- Use Encodec instead of MelSpectrogram.
- Use character-base input instead of phoneme, and GT duration as a target of duration predictor instead of MAS.
- Use AdaLN-Zero from DiT for speaker-conditioned text encoder instead of concat and self-attention.
- Use transformer as Flow Matching decoder instead of Wavenet blocks with AdaLN-Single timestep conditioning from PixArt-α
- Use attention pooling instead of mean pooling to get fixed-size speaker embedding as P-Flow used in their ablation study.
- Use conv-feedforward(FFT Block from Fastspeech) and GeGLU
- Use Alibi + Convolution positional encoding in transformer, from data2vec 2.0 and voicebox
- Use null cond for CFG sampling instead of mean-pooled hidden vectors.
- I borrowed some code from VITS repo, voicebox-pytorch, and fairseq2.
- This research used datasets from 'The Open AI Dataset Project (AI-Hub, S. Korea)'. All data information can be accessed through 'AI-Hub (www.aihub.or.kr)