VCTK Training #27

AWAS666 · 2023-08-28T14:06:01Z

AWAS666
Aug 28, 2023

Has someone started the training of a vctk dataset yet?
Not sure how long training would take, but I have some spare GPU power every now and then (2x 3090), so I could train it on and off but that would take a while.

p0p4k · 2023-08-28T14:16:52Z

p0p4k
Aug 28, 2023
Maintainer

#13 Maybe you might have to train it from scratch.

0 replies

AWAS666 · 2023-08-28T20:45:52Z

AWAS666
Aug 28, 2023
Author

Finally got it training, I think, the current version of VCTK (0.92) seems to be missing ~500 audio files which have been used in filelist

0 replies

AWAS666 · 2023-08-30T06:34:35Z

AWAS666
Aug 30, 2023
Author

Only checked the tensorboard, but I think it's getting there:
Currently at 32000 steps, batch size 32, dual gpu:
https://drive.google.com/drive/folders/1OGlsgBfKjd6coCTNlF3_z2ZVmOfLZOH-?usp=sharing

@p0p4k is it worth to go for a full train up to multiple 100k steps or rather wait for more stuff to be fixed/adjusted?

3 replies

p0p4k Aug 30, 2023
Maintainer

How is the audio quality in validation dataset? Does it sound better than VITS-1?

AWAS666 Aug 30, 2023
Author

like I said I didnt have time today to check out any more than the one audio that tensorboard had, it sounded like it could take some more training.

AWAS666 Aug 30, 2023
Author

definitely needs more training I feel like, I'll leave it running for another night

AWAS666 · 2023-08-31T06:57:35Z

AWAS666
Aug 31, 2023
Author

Example.zip
I compared it to the vits vctk model of coqui and quality whise it is kinda hard to say.
I may have to normalize the speech in my dataset as it is awfully quiet and it is still rather fast, but I think there was another error regarding that which is fixed now right?

This is at 76k steps now btw.
Examples in the included zip file, both coqui and my trained one on p225

2 replies

p0p4k Aug 31, 2023
Maintainer

You can volume norm the output audio np array before converting to wav file. Do it for both the files and see if it makes any difference. (open the file as numpy array, volume norm them, save them and compare).

def volume_norm(*, x: np.ndarray = None, coef: float = 0.95, **kwargs) -> np.ndarray:
    """Normalize the volume of an audio signal.

    Args:
        x (np.ndarray): Raw waveform.
        coef (float): Coefficient to rescale the maximum value. Defaults to 0.95.

    Returns:
        np.ndarray: Volume normalized waveform.
    """
    return x / abs(x).max() * coef

AWAS666 Sep 1, 2023
Author

after norming it I gotta say it sounds more monotonic than v1, but better pronounciations.
Examples.zip

AWAS666 · 2023-09-11T16:13:46Z

AWAS666
Sep 11, 2023
Author

Unlike in the ljspeech discussion, I feel like my trained model if currently worse in terms of emphasis and pauses.
But speed also normalized with training, I'll go on for twice as long and see where this takes me (174k steps currently at batch 32 single gpu)

8 replies

AWAS666 Sep 20, 2023
Author

overall it seems to be getting there, I'm now just trying another shorter run to see if it works better by further trimming it or using the files as is.
otherwise I'll be pushing it to 500k+ steps or so, not rly sure where to stop as g/total seems to be going up for a while but g/mel still going down and the latest sounds better to me than the one where g/total would be lowest

At ~375k steps:

p0p4k Sep 21, 2023
Maintainer

With subjective opinion based generative models, an objective loss function does not show the whole picture. Thus listening to the samples is the only way to go (MOS).

p0p4k Sep 21, 2023
Maintainer

Also try a train with dur_disc on. Let me know how it performs. Thanks.

AWAS666 Sep 28, 2023
Author

without the vad cleaning it failed spectacularly, quite wild
will post some examples of the non dur disc model later together with the model at 800k steps at batch 32

p0p4k Sep 29, 2023
Maintainer

Interesting. Let's try to solve this issue.

AWAS666 · 2023-09-28T19:19:55Z

AWAS666
Sep 28, 2023
Author

Checkpoints at 600k and 800kish:
https://drive.google.com/drive/folders/1PKYof9QdKr5jRWH4G-HsbGqm4di705Uv?usp=sharing

Examples with 3 speakers:
examples.zip

Tensorboard:

Will train it to 1m and then likely stop as there doesn't seem to be much more improvement.
It sounds pretty good but I think that given multiple sentences it sometimes takes to little break depending on the speaker.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

VCTK Training #27

{{title}}

Replies: 6 comments 13 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

VCTK Training #27

AWAS666 Aug 28, 2023

Replies: 6 comments · 13 replies

p0p4k Aug 28, 2023 Maintainer

AWAS666 Aug 28, 2023 Author

AWAS666 Aug 30, 2023 Author

p0p4k Aug 30, 2023 Maintainer

AWAS666 Aug 30, 2023 Author

AWAS666 Aug 30, 2023 Author

AWAS666 Aug 31, 2023 Author

p0p4k Aug 31, 2023 Maintainer

AWAS666 Sep 1, 2023 Author

AWAS666 Sep 11, 2023 Author

AWAS666 Sep 20, 2023 Author

p0p4k Sep 21, 2023 Maintainer

p0p4k Sep 21, 2023 Maintainer

AWAS666 Sep 28, 2023 Author

p0p4k Sep 29, 2023 Maintainer

AWAS666 Sep 28, 2023 Author

AWAS666
Aug 28, 2023

Replies: 6 comments 13 replies

p0p4k
Aug 28, 2023
Maintainer

AWAS666
Aug 28, 2023
Author

AWAS666
Aug 30, 2023
Author

p0p4k Aug 30, 2023
Maintainer

AWAS666 Aug 30, 2023
Author

AWAS666 Aug 30, 2023
Author

AWAS666
Aug 31, 2023
Author

p0p4k Aug 31, 2023
Maintainer

AWAS666 Sep 1, 2023
Author

AWAS666
Sep 11, 2023
Author

AWAS666 Sep 20, 2023
Author

p0p4k Sep 21, 2023
Maintainer

p0p4k Sep 21, 2023
Maintainer

AWAS666 Sep 28, 2023
Author

p0p4k Sep 29, 2023
Maintainer

AWAS666
Sep 28, 2023
Author