Skip to content

Latest commit

 

History

History
31 lines (20 loc) · 2.44 KB

MODEL_CARD.md

File metadata and controls

31 lines (20 loc) · 2.44 KB

Model Card for P-Flow Encodec TTS (English, Korean, Japanese)

Model Details

Model Description

P-Flow Encodec is Text-to-Speech model based on paper P-Flow: A Fast and Data-Efficient Zero-Shot TTS through Speech Prompting, with some modification. You can check differences here. Model consists of Encodec model from Meta, and Multiband Diffusion decoder, which is also from Meta.

Intended Use

Primary intended use

This model is trained for zero-shot Multilingual TTS. It can be used for generating speech from text in English, Korean, Japanese. Primary intended use is for research purpose, as a baseline for multilingual, code-switch TTS.

Out of scope use cases

This model should not be used for generating or editing someone's speech without his/her consent, including but not limited to government leaders, political figures, and celebrities.

Training Details

  • Training dataset: LibriTTS-R, Korean and Japanese corpus from AIHub 131 dataset (Multi-lingual Read Speech corpus for Translation). Only samples with duration less than 15 seconds and over 3.5 seconds are used, 380 hours for english, 637 hours for japanese, 705 hours for korean.
  • Finetuned from: Multilingual model is initialized using merged pretrained model for each languages mentioned above. Monolingual model for each language is trained with ~250K steps, and then merge their weights with average. Then, it is finetuned with ~250K steps with all languages.
  • Compute Resource: All model is trained with one RTX 4090 GPU. It takes about 1 day for 100K steps using 4 gradient accumulation steps with batch_durations of 100.