-
Notifications
You must be signed in to change notification settings - Fork 41
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Usage of audio_slice_frames, sample_frames, pad #12
Comments
Hi @wade3han, Yeah, I should add some comments explaining those parameters. First, To speed things up I only take the middle I hope that helps. |
Well, I was training new model from scratch using Korean speech data corpus. It has 300 hours amount of various speakers' utterances, and I was getting those artifacts after I tried to use Actually, I'm not sure why those artifacts are generated... I will share you if i figure out why. Please share your opinion if you have any ideas. |
@bshall I have one question to your first reply in this thread. Instead of having 40 mel frames, why not use 8 mel frames itself in the input of the rnn1 layer itself? |
Hi @dipjyoti92, sorry about the delay I've been away. I found that using only 8 frames as input to the rnn1 layer results in the generated audio being only silences. I think 8 frames is too short for the rnn to learn to appropriately use the reset gate although I haven't investigated this thoroughly. |
Hello @bshall ! Thank you!! |
Hi @macarbonneau, No problem, I'm glad you found the repo useful. I haven't tried using the end (or beginning) segments but there's no real reason it shouldn't work. The thinking behind using the middle segment was to match the training and inference conditions as much as possible. At inference time most of the input to the autoregressive part of the model ( Hope that explains my thinking. If anything is unclear let me know. One of the negative side effects of only using the middle segment is that there are sometimes small artifacts at the beginning or end of the generated audio. For the best quality it might be worth putting in some extra time to train on the entire segment. |
Hello,
I saw that you used
pad
,audio_slice_frames
,sample_frames
but I can't understand the usage of those params. Can you explain the meanings of them?Also,
WaveRNN
model was using padded mel input in the first GRU layer. However you just sliced out paddings after the first layer. Is it important to use padded mel in first GRU?Thanks.
The text was updated successfully, but these errors were encountered: