Speech Synthesis Paper

List of speech synthesis papers. Welcome to recommend awesome papers😀

TTS Frontend

Pre-trained Text Representations for Improving Front-End Text Processing in Mandarin Text-to-Speech Synthesis (Interspeech 2019)
A unified sequence-to-sequence front-end model for Mandarin text-to-speech synthesis (ICASSP 2020)
A hybrid text normalization system using multi-head self-attention for mandarin (ICASSP 2020)

Acoustic Model

Vocoder

Autoregressive Model

WaveNet: WaveNet: A Generative Model for Raw Audio (2016)
WaveRNN: Efficient Neural Audio Synthesis (ICML 2018)
LPCNet: LPCNet: Improving Neural Speech Synthesis Through Linear Prediction (ICASSP 2019)
GAN-TTS: High Fidelity Speech Synthesis with Adversarial Networks (2019)
WaveGAN: Adversarial Audio Synthesis (2018)
MultiBand-WaveRNN: DurIAN: Duration Informed Attention Network For Multimodal Synthesis (2019)

Non-autoregressive Model

Parallel-WaveNet: Parallel WaveNet: Fast High-Fidelity Speech Synthesis (2017)
WaveGlow: WaveGlow: A Flow-based Generative Network for Speech Synthesis (2018)
Parallel-WaveGAN: Parallel WaveGAN: A fast waveform generation model based on generative adversarial networks with multi-resolution spectrogram (2019)
MelGAN: MelGAN: Generative Adversarial Networks for Conditional Waveform Synthesis (NeurIPS 2019)
MultiBand-MelGAN: Multi-band MelGAN: Faster Waveform Generation for High-Quality Text-to-Speech (2020)
VocGAN: VocGAN: A High-Fidelity Real-time Vocoder with a Hierarchically-nested Adversarial Network (Interspeech 2020)
WaveGrad: WaveGrad: Estimating Gradients for Waveform Generation (2020)
DiffWave: DiffWave: A Versatile Diffusion Model for Audio Synthesis (2020)
HiFi-GAN: HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis (2020)

TTS towards Stylization

Expressive TTS

ReferenceEncoder-Tacotron: Towards End-to-End Prosody Transfer for Expressive Speech Synthesis with Tacotron (ICML 2018)
GST-Tacotron: Style Tokens: Unsupervised Style Modeling, Control and Transfer in End-to-End Speech Synthesis (ICML 2018)
Predicting Expressive Speaking Style From Text In End-To-End Speech Synthesis (2018)
GMVAE-Tacotron2: Hierarchical Generative Modeling for Controllable Speech Synthesis (ICLR 2019)
(Multi-style Decouple): Multi-Reference Neural TTS Stylization with Adversarial Cycle Consistency (2019)
(Multi-style Decouple): Multi-reference Tacotron by Intercross Training for Style Disentangling,Transfer and Control in Speech Synthesis (InterSpeech 2019)
Mellotron: Mellotron: Multispeaker expressive voice synthesis by conditioning on rhythm, pitch and global style tokens (2019)
Flowtron (flow based): Flowtron: an Autoregressive Flow-based Generative Network for Text-to-Speech Synthesis (2020)
(local style): Fully-hierarchical fine-grained prosody modeling for interpretable speech synthesis (ICASSP 2020)
Controllable Neural Prosody Synthesis (Interspeech 2020)

MultiSpeaker TTS

Sample Efficient Adaptive Text-to-Speech (ICLR 2019)
SV-Tacotron: Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis (NeurIPS 2018)
Deep Voice V3: Deep Voice 3: Scaling Text-to-Speech with Convolutional Sequence Learning (ICLR 2018)
Zero-Shot Multi-Speaker Text-To-Speech with State-of-the-art Neural Speaker Embeddings (ICASSP 2020)
MultiSpeech: MultiSpeech: Multi-Speaker Text to Speech with Transformer (2020)
SC-WaveRNN: Speaker Conditional WaveRNN: Towards Universal Neural Vocoder for Unseen Speaker and Recording Conditions (Interspeech 2020)
MultiSpeaker Dataset: AISHELL-3: A Multi-speaker Mandarin TTS Corpus and the Baselines (2020)

Voice Conversion

ASR Based (or combine text information)

(introduce PPG into voice conversion): Phonetic posteriorgrams for many-to-one voice conversion without parallel data training (2016)
A Vocoder-free WaveNet Voice Conversion with Non-Parallel Data (2019)
TTS-Skins: TTS Skins: Speaker Conversion via ASR (2019)
One-shot Voice Conversion by Separating Speaker and Content Representations with Instance Normalization (InterSpeech 2019)
Cotatron (combine text information with voice conversion system): Cotatron: Transcription-Guided Speech Encoder for Any-to-Many Voice Conversion without Parallel Data (Interspeech 2020)
(TTS & ASR): Voice Conversion by Cascading Automatic Speech Recognition and Text-to-Speech Synthesis with Prosody Transfer (InterSpeech 2020)

VAE Based (or auto-encoder)

VAE-VC (VAE based): Voice Conversion from Non-parallel Corpora Using Variational Auto-encoder (2016)
(Speech representation learning by VQ-VAE): Unsupervised speech representation learning using WaveNet autoencoders (2019)
Blow (Flow based): Blow: a single-scale hyperconditioned flow for non-parallel raw-audio voice conversion (NeurIPS 2019)
AutoVC: AUTOVC: Zero-Shot Voice Style Transfer with Only Autoencoder Loss (2019)
F0-AutoVC: F0-consistent many-to-many non-parallel voice conversion via conditional autoencoder (ICASSP 2020)
One-Shot Voice Conversion by Vector Quantization (ICASSP 2020)
SpeechFlow (auto-encoder): Unsupervised Speech Decomposition via Triple Information Bottleneck (ICML 2020)

GAN Based

CycleGAN-VC V1: Parallel-Data-Free Voice Conversion Using Cycle-Consistent Adversarial Networks (2017)
StarGAN-VC: StarGAN-VC: non-parallel many-to-many Voice Conversion Using Star Generative Adversarial Networks (2018)
CycleGAN-VC V2: CycleGAN-VC2: Improved CycleGAN-based Non-parallel Voice Conversion (2019)
CycleGAN-VC V3: CycleGAN-VC3: Examining and Improving CycleGAN-VCs for Mel-spectrogram Conversion (2020)

Singing

Singing Synthesis

XiaoIce Band: XiaoIce Band: A Melody and Arrangement Generation Framework for Pop Music (KDD 2018)
Mellotron: Mellotron: Multispeaker expressive voice synthesis by conditioning on rhythm, pitch and global style tokens (2019)
ByteSing: ByteSing: A Chinese Singing Voice Synthesis System Using Duration Allocated Encoder-Decoder Acoustic Models and WaveRNN Vocoders (2020)
JukeBox: Jukebox: A Generative Model for Music (2020)
XiaoIce Sing: XiaoiceSing: A High-Quality and Integrated Singing Voice Synthesis System (2020)
HiFiSinger: HiFiSinger: Towards High-Fidelity Neural Singing Voice Synthesis (2019)
Sequence-to-sequence Singing Voice Synthesis with Perceptual Entropy Loss (2020)

Singing Voice Conversion

A Universal Music Translation Network (2018)
Unsupervised Singing Voice Conversion (Interspeech 2019)
PitchNet: PitchNet: Unsupervised Singing Voice Conversion with Pitch Adversarial Network (ICASSP 2020)
DurIAN-SC: DurIAN-SC: Duration Informed Attention Network based Singing Voice Conversion System (Interspeech 2020)
Speech-to-Singing Conversion based on Boundary Equilibrium GAN (Interspeech 2020)

Name		Name	Last commit message	Last commit date
Latest commit History 70 Commits
papers		papers
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Speech Synthesis Paper

TTS Frontend

Acoustic Model

Autoregressive Model

Non-autoregressive Model

Alignment Study

Data Efficiency

Vocoder

Autoregressive Model

Non-autoregressive Model

TTS towards Stylization

Expressive TTS

MultiSpeaker TTS

Voice Conversion

ASR Based (or combine text information)

VAE Based (or auto-encoder)

GAN Based

Singing

Singing Synthesis

Singing Voice Conversion

About

Releases

Packages

License

Jwei-Lee/speech-synthesis-paper

Folders and files

Latest commit

History

Repository files navigation

Speech Synthesis Paper

TTS Frontend

Acoustic Model

Autoregressive Model

Non-autoregressive Model

Alignment Study

Data Efficiency

Vocoder

Autoregressive Model

Non-autoregressive Model

TTS towards Stylization

Expressive TTS

MultiSpeaker TTS

Voice Conversion

ASR Based (or combine text information)

VAE Based (or auto-encoder)

GAN Based

Singing

Singing Synthesis

Singing Voice Conversion

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Packages