-
Notifications
You must be signed in to change notification settings - Fork 5
Description
Hello, thank you very much for your excellent work and for releasing the code. I have carefully studied both the paper and the implementation, and I would like to ask for clarification regarding several points I found during reproduction.
First, based on my experiments, the source singing voice appears to must be provided at 44.1 kHz for the inference pipeline to work properly. Although there is a downsampling step inside the inference process, if the input audio is directly provided at 16 kHz, the model fails to produce reasonable results. This behavior seems to imply that the system fundamentally relies on a 44.1 kHz input, which makes it unclear how the voice super-resolution capability claimed in the paper is actually realized in practice.
Second, I noticed that in the released inference code, F0 is extracted from the 44.1 kHz waveform, whereas the paper describes F0 extraction being performed on the downsampled 16 kHz audio. This discrepancy between the paper description and the implementation raises some confusion about the intended design choice.
Could you please clarify:
Whether 44.1 kHz input audio is a strict requirement during inference, and if so, how this aligns with the claimed audio super-resolution setting?
Why F0 is extracted at 44.1 kHz in the inference code instead of 16 kHz as stated in the paper?
Thank you very much for your time and for your valuable work. I believe this clarification would be very helpful for researchers attempting to faithfully reproduce the method.