Paper, Tags: #audio
Modern text-to-speech (TTS) pipelines have many processing stages, and each is learnt independently. These include text normalisation, aligned linguistic featuriation, mel-spectrogram synthesis, nad raw audio waveform synthesis. They often require supervision at each stage, sometimes expensive ground truth annotations to guide the outputs of each stage, and sequential training of the stages.
In this paper we learn to synthesise speech from normalised text on an end-to-end manner. Our models operate directly on character or phoneme input sequences, and produce raw speech audio outputs. Output samples can be found here. We propose EATS, end-to-end adversarial text-to-speech - generative models for TTS trained adversarially. The intermediate bottlenecks are eliminated.
Our model has two high-level submodules: the aligner processes the raw input sequence and produces low-frequency (200Hz) aligned features in its own learnt, abstract feature space. It predicts the duration of each input token. These features are input to the decoder, which upsamples the features by 1D convolutions to produce 24kHz audio waveforms.
We are inspired by GAN-TTS, we employ its generator as the decoder in our model but instead of up-sampling pre-computed linguistic features, its input comes from the aligner block. We make it speaker-conditional by feeding in a speaker embedding. We also adopt the multiple random window discrimination (RWDs) from GAN-TTS.
Our model does not rely on autoregressive sampling or teacher forcing, avoiding issues like exposure bias and reduced parallelism at inference time. There is a gap between the fidelity of the speech produced by our method and the state-of-the-art systems, we believe that the end-to-end problem setup is a promising avenue for future advancements and research in text-to-speech. Our current approach does not attempt to address the text normalisation and phonemisation problems, we rely on a separate, fixed system for these aspects. a fully end-to-end TTS system should operate on unnormalised raw text.