This work is done as sound classification task in Alibaba Israel, link to paper https://arxiv.org/abs/2204.11479
@article{gazneli2022end, title={End-to-End Audio Strikes Back: Boosting Augmentations Towards An Efficient Audio Classification Network}, author={Gazneli, Avi and Zimerman, Gadi and Ridnik, Tal and Sharir, Gilad and Noy, Asaf}, journal={arXiv preprint arXiv:2204.11479}, year={2022} }
utils/resample.py is mainly taken from - https://github.com/danpovey/filtering/blob/master/lilfilter/resampler.py
emb_dim 128 nf 16 dim_feedforward 512 n_layers 4 n_head 8
emb_dim 256 nf 32 dim_feedforward 2048 n_layers 6 n_head 16
ESC-50 Audioset Uraban8K Speechcommands
The augmentations contain two types of transforms -
- label preserving (audio_augs) and label mixing - implemented on cpu during fetching sample in dataset
- label mixing - implemented on GPU (batch_augs)
The samples downsampled to 22.05KHz and saved as wav format. if one want to use the original samples jusst modify the esc_dataset to read the coresponding file type.
The samples resampled to 22.05KHz and saved as wav format. During training the sample will be zero padded in case if it is smaller than 4 seconds
Fs=16KHz seq_len=16384 ~1sec
Fs=22.05KHz
seq_len=221184 ~10sec
Requires preprocessing\
- convert labels to list of integers\
- resample to 22.05KHz and compress by saving in flac or ogg format\
- verify missing files, missing labels etc\
- save the data in pkl file\
Fs=22.05KHz
seq_len = 114688 ~5sec
python trainer.py --max_lr 3e-4 --run_name r1 --emb_dim 128 --dataset esc50 --seq_len 114688 --mix_ratio 1 --epoch_mix 12 --mix_loss bce --batch_size 128 --n_epochs 3500 --ds_factors 4 4 4 4 --amp --save_path outputs\
Fs=22.05KHz
seq_len = 221184 ~10sec
EAT-M - (for EAT-S modify the network parameters)
python trainer.py --max_lr 3e-4 --run_name r1 --dataset audioset --seq_len 221184 --mix_ratio 1 --epoch_mix 2 --mix_loss bce --batch_size 208 --n_epochs 250 --scheduler onecycle --ds_factors 4 4 4 4 --save_path outputs --num_workers 32 --use_balanced_sampler --multilabel --amp --data_subtype full --use_dp --loss_type bce --augs_noise none --emb_dim 256 --nf 32 --dim_feedforward 2048 --n_layers 6 --n_head 16\
Fs=22.05KHz
seq_len = 90112 ~4sec
python trainer.py --max_lr 3e-4 --run_name r1 --emb_dim 128 --dataset urban8k --seq_len 90112 --mix_ratio 1 --epoch_mix 12 --mix_loss bce --batch_size 128 --n_epochs 3500 --ds_factors 4 4 4 4 --amp --save_path outputs\
Fs=16KHz
seq_len = 16384 ~1sec
use use_bg in case one want to add background noise given in speechcommands dataset
python trainer.py --max_lr 3e-4 --run_name r1 --emb_dim 128 --dataset esc50 --seq_len 16384 --mix_ratio 1 --epoch_mix 12 --mix_loss bce --batch_size 128 --n_epochs 1500 --ds_factors 4 4 4 --amp --save_path outputs
python inference.py --f_res outputs/r1