- Team members : 楊佳誠、邱以中、蔡桔析
- This is the final project for NYCU_DLP course.
- After reading the paper"Palette:ASimple,General Framework for Image-to-Image Translation," we found it interesting to investigate whether the ablation results are consistent with those in the audio domain.
- We will compare the L1 and L2 loss and also evaluatethe significance of self-attention and normalization in the audio diffusion architecture.
- Using this github to implement our project:https://github.com/teticio/audio-diffusion
- The audio diffusion backbone utilizes a U-Net architecture.
- The class_embedding of the UNet2DModel is utilized to incorporate both the time embedding and class_embedding, treating them as part of the conditional input of the model.
- ESC-50 consists of 5-second-long recordings organized into 50 semantical classes, with 40 examples per class.
- This dataset consists of the following five main categories: Animals, Natural, Human non-speech sounds, Interior sounds, and Exterior noises.
- The .wav data is preprocessed into Mel Spectrograms. This can be done by audio_diffusion/scripts/audio_to_images.py.
- The Mel spectrograms will be normalized according to the experiment setting.
- We write our code in the audio_diffusion/scripts folder.
- Use train_unet.py to train our model
- Load the preprocess data folder by the path where audio_to_images.py generates.
- Use audio_diffusion/scripts/test_cond_model.py to generate sample. This program generate 40 .wav files for 50 classes in ESC-50.
- There are several things need to modified before you run this code:
- Modify the parameter of parser
- Replace the path in line 171 by your pretrained unet weight.(ex. /unet/diffusion_pytorch_model.bin)
- Modify the model_index.json file in your saved model path:
"mel": [
"audio_diffusion", # change null to "audio_diffusion"
"Mel"
],
- We utilize our model to generate 50 classes of audio, producing 40 audio samples for each class as evaluation data.
- The FAD score is a metric employed to measure the similarity between evaluation data and original data. A lower FAD score indicates a closer match between the distributions of the generated and real audio.
- The CA score, utilizing pretrained Contrastive Language-Audio Pretraining (CLAP), is used to assess whether our model can successfully generate the correct voice.
- To compute FAD and CA, the path should contain 50 folder, which is named from 0 to 49 by its label. Each folder should contain 40 .wav generate from the same class.
- Take a look at audio_evaluate/Predict/L2 as an example.
- Use audio_evaluate/evaluate.py to compute FAD score.
dir_1 = path of ground truth .wav.
dir_2 = path of generate .wav.
- Download this github:https://github.com/LAION-AI/CLAP
- Use audio_evaluate/result/CA/zero-shot-classification/CLASS.PY to compute CA.
- Go to https://huggingface.co/lukewys/laion_clap/blob/main/630k-audioset-best.pt download the pretrained weight.
- Put your generate file path to esc50_test_dir