Thank you so much for your fantastic work!
I found a few videos shorter than 60s in your dataset. When using your frame extraction script to extract frames from a video in the 1fps manner, I could not get 60 frames, however, the shape of the corresponding audio feature was [60, 128] in vggish folder.
It would be so grateful if you let me know how to align the audio and frames from the same video.