Music-Genre-Classification-using-Deep-Learning

Data

For a project, I compared multiple Deep Learning models to classify and eventually predict music genres based on 30-seconds audio segments. The data used for this project is the GTZAN dataset. For the purposes of this project, the original data was reduced to 8 genres (800 songs) and transformed such that for each song, 15 log-transformed Mel spectrograms are obtained. Each Mel spectrogram is an image file represented by a tensor of shape (80, 80, 1) which describes time, frequency and intensity of a song segment. The training data represent 80% of the total number of data points. I implemented a Parallel CNN, CNN-RNN and a stylised model, which are briefly discussed below:

Parallel CNN

The first model is a parallel CNN and implemented as follows:

First parallel branch:
1. one convolutional layer processing the input data with 3 square filters of size 8, padding and leaky ReLU activation function with slope 0.3.
2. one pooling layer which implements Max Pooling over the output of the convolutional layer, with pooling size 4.
3. a layer flattening the output of the pooling.
Second parallel branch:
1. one convolutional layer processing the input data with 4 square filters of size 4, padding and leaky ReLU activation function with slope 0.3.
2. one pooling layer which implements Max Pooling over the output of the convolutional layer, with size 2.
3. a layer flattening the output of the pooling..
Merging branch:
1. a layer concatenating the outputs of the two parallel branches.
2. a dense layer which performs the classification of the music genres using the approppriate activation function.

We use tf.keras.losses.CategoricalCrossentropy() as a loss function and mini-batch stochastic gradient descent as optimiser. The epoch size is set to 50.

CNN-RNN

To implement a CNN-RNN model, we reduced the dimensionality of the dataset through reduce_dimension function to (80,80). The model's architecture is structured as follows:

a convolutional layer with 8 square filters of size 4.
a max pooling layer that halves the dimensionality of the output.
a convolutional layer with 6 square filters of size 3.
a max pooling layer that halves the dimensionality of the output.
an LSTM layer with 128 units, returning the full sequence of hidden states as output.
an LSTM layer with 32 units, returning only the last hidden state as output.
a dense layer with 200 neurons and ReLU activation function.
a layer dropping out 20% of the neurons during training.
a dense layer which outputs the probabilities associated to each genre.

The training procedure is identical to before.

Newly proposed models

In the final part I propose two models, final_model_eff() and final_model_acc(), which both significantly increase the classification accuracy. While final_model_eff() is more efficient, final_model_acc() attains the highest accuracies overall. The intuition behind both models and their implementation is further explained in the main.ipynb file.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
README.md		README.md
main.ipynb		main.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Music-Genre-Classification-using-Deep-Learning

Data

Parallel CNN

CNN-RNN

Newly proposed models

About

Uh oh!

Releases

Packages

Languages

hraj10/Music-Genre-Classification-using-Deep-Learning

Folders and files

Latest commit

History

Repository files navigation

Music-Genre-Classification-using-Deep-Learning

Data

Parallel CNN

CNN-RNN

Newly proposed models

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages