10 seconds of John Belfram's 10 Days of Blue
A finetune test using the dreambooth method. Dataset is five 512x512 spectrograms of 10 second audio chunks of the same song as above. Trained using diffusers acceleration with 400 iterations (only a few mins on 4090 if I remember correctly)
Pretty good considering such quick finetuning.
Up to 10 seconds audio is indistinguishable from the original
chunk_10_seconds.mp4
Above 10 seconds audio quality gets progressively worse, starting with the low-end.
chunk_12_seconds.mp4
Using librosa, converting to Mel then reconstructing with Griffin-Lim
reconstructed_audio.mp4
Using Riffusion's conversion pipeline that wraps torchaudio, again converting to Mel then reconstructing with Griffin-Lim. Very hard to find any similarity with original audio.
output_audio.1.mp4
Finetuned model spectrogram output using the librosa code
reconstructed_audio_from_finetuned.mp4
It's a start I guess! :D