Skip to content

voice super-resolution capability #3

@Kun899213

Description

@Kun899213

Hello, thank you very much for your excellent work and for releasing the code. I have carefully studied both the paper and the implementation, and I would like to ask for clarification regarding several points I found during reproduction.

First, based on my experiments, the source singing voice appears to must be provided at 44.1 kHz for the inference pipeline to work properly. Although there is a downsampling step inside the inference process, if the input audio is directly provided at 16 kHz, the model fails to produce reasonable results. This behavior seems to imply that the system fundamentally relies on a 44.1 kHz input, which makes it unclear how the voice super-resolution capability claimed in the paper is actually realized in practice.

Second, I noticed that in the released inference code, F0 is extracted from the 44.1 kHz waveform, whereas the paper describes F0 extraction being performed on the downsampled 16 kHz audio. This discrepancy between the paper description and the implementation raises some confusion about the intended design choice.

Could you please clarify:

Whether 44.1 kHz input audio is a strict requirement during inference, and if so, how this aligns with the claimed audio super-resolution setting?

Why F0 is extracted at 44.1 kHz in the inference code instead of 16 kHz as stated in the paper?

Thank you very much for your time and for your valuable work. I believe this clarification would be very helpful for researchers attempting to faithfully reproduce the method.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions