-
Notifications
You must be signed in to change notification settings - Fork 219
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Inquiry Regarding Audio Spectrogram Transformer #128
Comments
This could be many reasons, but I do not have time to debug (and do not have information). The possible reasons include ESC-50 is balanced, so acc is a good metric, your dataset might be imbalanced. So the model is biased to the majority class, in that case, you would need to turn on class balancing, etc. Acc is not a good measure when the dataset is not balanced.
If you wish to get the last layer embedding, you should let the model return Line 184 in 31088be
That is totally possible, you can simply prepare two dataset and replace the original dataset in our recipe. But you might lose some performance due to the training / test mis-match. This will be true for all models, not just AST. -Yuan |
I am a graduate student from China, and our team recently had the privilege of studying your article on the 'Audio Spectrogram Transformer'. We were truly impressed by the content and scope of your work, and it has sparked a great deal of interest within our team. Following our admiration for your research, we endeavored to replicate your work on the ESC-50 dataset. However, as we proceeded to fine-tune the model using our own dataset, we encountered several challenges. We would greatly appreciate your guidance and assistance in navigating these challenges.
1、Our dataset consists of 2400 samples, each audio clip is 4 seconds. We set the audio_length parameter to 400 and timen to 80. We replaced the labels while keeping the rest recipe consistent with ESC-50. We downloaded a pre-trained model from Audioset and followed the same process as replicating ESC-50. We are pleased with the final result; the accuracy can reach 0.9. However, what surprises us is that the average precision is only between 0.3 to 0.5. Why could this be?
2、We understand your work involves projecting spectrograms to embeddings. (If our understanding of your work is incorrect, please forgive us.) After fine-tuning the model, we process new speech data and aim to obtain the embeddings. Could you please guide us on how to do this?
3、For example, if we want to fine-tune a pre-trained model with an English dataset and then validate the fine-tuned model with a Chinese dataset, can we set the training set as the English dataset and the validation set as the Chinese dataset during the fine-tuning process?
The text was updated successfully, but these errors were encountered: