Model Card for P-Flow Encodec TTS (English, Korean, Japanese)

Model Details

Model Description

P-Flow Encodec is Text-to-Speech model based on paper P-Flow: A Fast and Data-Efficient Zero-Shot TTS through Speech Prompting, with some modification. You can check differences here. Model consists of Encodec model from Meta, and Multiband Diffusion decoder, which is also from Meta.

Developed by: seastar105
Model type: Text-to-Speech
Language: English, Korean, Japanese
License: MIT for codes, also it's free to use model weights, but you should indicate that model weight is trained with data from AI Hub, (e.g. This research (paper) used datasets from 'The Open AI Dataset Project (AI-Hub, S. Korea)'. All data information can be accessed through 'AI-Hub (www.aihub.or.kr))
Model version: 1.0
Code Repository: https://github.com/seastar105/pflow-encodec
Model Repository: https://huggingface.co/seastar105/pflow-encodec-ejk

Intended Use

Primary intended use

This model is trained for zero-shot Multilingual TTS. It can be used for generating speech from text in English, Korean, Japanese. Primary intended use is for research purpose, as a baseline for multilingual, code-switch TTS.

Out of scope use cases

This model should not be used for generating or editing someone's speech without his/her consent, including but not limited to government leaders, political figures, and celebrities.

Training Details

Training dataset: LibriTTS-R, Korean and Japanese corpus from AIHub 131 dataset (Multi-lingual Read Speech corpus for Translation). Only samples with duration less than 15 seconds and over 3.5 seconds are used, 380 hours for english, 637 hours for japanese, 705 hours for korean.
Finetuned from: Multilingual model is initialized using merged pretrained model for each languages mentioned above. Monolingual model for each language is trained with ~250K steps, and then merge their weights with average. Then, it is finetuned with ~250K steps with all languages.
Compute Resource: All model is trained with one RTX 4090 GPU. It takes about 1 day for 100K steps using 4 gradient accumulation steps with batch_durations of 100.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MODEL_CARD.md

MODEL_CARD.md

Model Card for P-Flow Encodec TTS (English, Korean, Japanese)

Model Details

Model Description

Intended Use

Primary intended use

Out of scope use cases

Training Details

Files

MODEL_CARD.md

Latest commit

History

MODEL_CARD.md

File metadata and controls

Model Card for P-Flow Encodec TTS (English, Korean, Japanese)

Model Details

Model Description

Intended Use

Primary intended use

Out of scope use cases

Training Details