Music has always been a huge part of our quotidian routine. The kind of music we listen to has a stronghold on our emotional core. The signals delivered by the audio-snippets from a class of music could be captured and are visually represented as a plot between different frequency signatures with respect to time-stamp as spectrogram. On the heatmaps generated, we can distinguish the power of different patched of the music based on the intensity of those regions.
In this repository, we propose an end-to-end pipeline for identifying the type of emotions induced by different genres of music by studying the mel-spectrogram of audio-snippets from Hindi Music obtained from the publicly available MER500 dataset on Kaggle. By employing dfifferent pre-trained CNN architectures, the audio snippets are pre-processed to extract the corresponding feature-space and passed as input to them.
The MER500 Dataset
consist of songs in 5 popular emotional categories for Hindi-Film Industry such as:-
- Romantic
- Happy
- Sad
- Devotional
- Party
It has approximately 100 Audio Files
of about 10 seconds
of Song-Clips per class label.
This repository focuses on the internal working of Pre-Trained Convolutional Neural Networks (CNNs), with different architectures as follows:-
AlexNet
:- It has 8 layers with learnable parameters. The model consists of 5 layers with a combination of Max Pooling followed by 3 fully connected layers and they use Relu activation in each of these layers, except the output layer.
VGG-16
:- It was proposed by Karen Simonyan and Andrew Zisserman in 2014 in the paper "Very Deep Convolutional Networks for Large Scale Image Recognition".
MobileNetV3-Small
:- It is a simple but efficient and not very computationally intensive convolutional neural network for mobile vision applications.
ResNet-18
:- It is a convolutional neural network that is 18 layers deep. ResNet includes several residual blocks that consist of convolutional layers, batch normalization layers and ReLU activation functions. We used the pretrained ResNet50 model to extract features from the Human Faces.
DenseNet-121
:- It has 120 Convolutions and 4 Average Pooling Layers, thereby requiring fewer parameters and allowing feature reuse. It resulted in more compact modelling and achieving SOTA performances and better results across competitive datasets, compared to their standard CNN or ResNet counterparts.
EfficientNet-B0
:- They are very much computationally efficient and also achieve SOTA results on ImageNet dataset, which is 84.4% on Top-1 Accuracy. It was developed using a Multi-Objective Neural Architecture Search that optimizes both accuracy and floating-point operations.
As far as the fine-tuning configurations are in concern, the following were performed:-
-
The Input Audio Data was sampled at
frequency of 16KHz
andpadded via shorter sliding windows
, following bynormalization
to single-peak central gaussian distribution. -
The representation of the Audio-Signals into Image-based Data was done via
Mel-Spectrograms
. -
Train-Test Split
was defined to80-20
,Batch-Size
was set to64
for best set of results andNumber of Epochs
were set to5
for quick results.