In this project, several approaches for training/finetuning an audio gender recognition is provided. The code can simply be used for any other classification by changing the number of classes and the input dataset.
Dataset should be a csv file that has two columns: audio_path
and lable
.
audio_path label
0 /home/ai/projects/speech/dataset/asr/new-raw-0.wav female
1 /home/ai/projects/speech/dataset/asr/samples_1.wav male
2 /home/ai/projects/speech/dataset/asr/new-raw-2.wav female
3 /home/ai/projects/speech/dataset/asr/new-raw-3.wav male
4 /home/ai/projects/speech/dataset/asr/new-raw-4.wav female
- LSTM_Model: uses mfccs to train a lstm model for audio classification. Trained using pytorchlightning.
- the idea of this structure is taken from LearnedVector repository which contains a wakeup model.
- transformer_scratch: Uses a transformer block for training an audio classification model with mfccs taken as inputs.
Trained using pytorchlightning.
- main implementation is taken from AnubhavGupta3377's repo called Text-Classification-Models-Pytorch
- It's modified to train audio samples.
- wav2vec2: Fine-tuning wav2vec2-base as an audio classification model using huggingface trainer.
Trained and evaluated on a custom dataset. You can simply download common-voice dataset and use the samples.
Model | Train ACC | Val Acc | Train F1-score | Val-F1-score |
---|---|---|---|---|
LSTM | 89 | 90 | 90.83 | 91 |
Wav2vec2 | - | 96.4 | - | 96.4 |
transfomer | 85.1 | 81.7 | 87.1 | 84.6 |