LibriVoc is a new open-source, large-scale dataset for vocoder artifact detection. LibriVoc is derived from the LibriTTS speech corpus, which is widely used in text-to-speech research. The LibriTTS corpus is derived from the Librispeech dataset, wherein each sample is extracted from LibriVox audiobooks.
The Dataset could be view or download from :https://drive.google.com/file/d/1JwxyWK52zSu96S1PEqmh59bttu3uHWrW/view?usp=share_link
We use six state-of-the-art neural vocoders to generate speech samples in the LibriVoc dataset, namely, WaveNet and WaveRNN from the autoregressive vocoders, Mel-GAN and Parallel WaveGAN from the GAN-based vocoders, and WaveGrad and DiffWave from the diffusion-based vocoders. Specifically, we have 126.41 hours of real samples and 118.08 hours of synthesized, self-vocoded samples in the training set. Table 1 shows the details of the LibriVoc dataset.
Table 1. The number of hours of audio synthesized by each neural vocoder in the LibriVoc dataset.
Model | train-clean-100 | train-clean-360 | dev-clean | test-clean |
---|---|---|---|---|
WaveNet | 4.28 | 15.49 | 0.75 | 0.76 |
WaveRNN | 4.33 | 14.92 | 0.67 | 0.72 |
MelGAN | 4.36 | 15.26 | 0.71 | 0.76 |
Parallel WaveGAN | 4.37 | 15.54 | 0.68 | 0.75 |
WaveGrad | 4.19 | 15.81 | 0.76 | 0.74 |
DiffWave | 4.16 | 15.37 | 0.62 | 0.66 |
Total | 25.69 | 92.39 | 4.19 | 4.39 |
Each vocoder synthesizes waveform samples from a given mel spectrogram extracted from an original sample; we refer to this process as “self-vocoding.” By providing each vocoder with the same mel spectrogram, we ensure that any unique artifacts present in the synthesized samples are attributable to the specific vocoder used to reconstruct the audio signal. We withhold a set of real samples to use as a validation set in the training process. Specifically, we design the LibriVoc dataset as follows:
- Samples corresponding to 25% of the speakers contain only real (original) samples.
- Samples corresponding to 25% of the speakers contain only synthesized samples.
- For each speaker in the remaining 50%, we allocate half of the samples from that speaker to be real and the other half to be synthesized. By doing so, we ensure that our classifier does not over-fit speaker identity during the training process. We further split the whole dataset into three non-overlapped sets for training (33, 236 samples), development (5, 736 samples), and testing (4, 837 samples).