LibriVoc-Dataset

LibriVoc is a new open-source, large-scale dataset for vocoder artifact detection. LibriVoc is derived from the LibriTTS speech corpus, which is widely used in text-to-speech research. The LibriTTS corpus is derived from the Librispeech dataset, wherein each sample is extracted from LibriVox audiobooks.

The Dataset could be view or download from :https://drive.google.com/file/d/1JwxyWK52zSu96S1PEqmh59bttu3uHWrW/view?usp=share_link

We use six state-of-the-art neural vocoders to generate speech samples in the LibriVoc dataset, namely, WaveNet and WaveRNN from the autoregressive vocoders, Mel-GAN and Parallel WaveGAN from the GAN-based vocoders, and WaveGrad and DiffWave from the diffusion-based vocoders. Specifically, we have 126.41 hours of real samples and 118.08 hours of synthesized, self-vocoded samples in the training set. Table 1 shows the details of the LibriVoc dataset.

Table 1. The number of hours of audio synthesized by each neural vocoder in the LibriVoc dataset.

Model	train-clean-100	train-clean-360	dev-clean	test-clean
WaveNet	4.28	15.49	0.75	0.76
WaveRNN	4.33	14.92	0.67	0.72
MelGAN	4.36	15.26	0.71	0.76
Parallel WaveGAN	4.37	15.54	0.68	0.75
WaveGrad	4.19	15.81	0.76	0.74
DiffWave	4.16	15.37	0.62	0.66
Total	25.69	92.39	4.19	4.39

Each vocoder synthesizes waveform samples from a given mel spectrogram extracted from an original sample; we refer to this process as “self-vocoding.” By providing each vocoder with the same mel spectrogram, we ensure that any unique artifacts present in the synthesized samples are attributable to the specific vocoder used to reconstruct the audio signal. We withhold a set of real samples to use as a validation set in the training process. Specifically, we design the LibriVoc dataset as follows:

Samples corresponding to 25% of the speakers contain only real (original) samples.
Samples corresponding to 25% of the speakers contain only synthesized samples.
For each speaker in the remaining 50%, we allocate half of the samples from that speaker to be real and the other half to be synthesized. By doing so, we ensure that our classifier does not over-fit speaker identity during the training process. We further split the whole dataset into three non-overlapped sets for training (33, 236 samples), development (5, 736 samples), and testing (4, 837 samples).

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
dev-clean		dev-clean
test-clean		test-clean
.DS_Store		.DS_Store
README.md		README.md
Vocoder.id.rtf		Vocoder.id.rtf
sample2label_dev.txt		sample2label_dev.txt
sample2label_test.txt		sample2label_test.txt
sample2label_train.txt		sample2label_train.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LibriVoc-Dataset

About

Releases

Packages

Languages

csun22/LibriVoc-Dataset

Folders and files

Latest commit

History

Repository files navigation

LibriVoc-Dataset

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages