Skip to content

BuggerBugs/SC1015-Mini-Project

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

32 Commits
 
 
 
 

Repository files navigation

SC1015 Mini Project - Song Popularity Predictor 🎼🎼

About

This is our Mini-Project for SC1015 ( Introduction to Data Science and Artificial Intelligence ) which focuses on songs from the Song Popularity Dataset from Kaggle. Here is an overview of the source code:

  1. Data Preparation
  2. Data Visualization
  3. Regression Models
  4. Classification Models
  5. Results Comparision

Contributors

  • @s-27-b (Bhargavi) - Data Visualisation, Regression Models
  • @BuggerBugs (Qi Yang) - Data Preparation, Classification Models, Results Comparision

Problem Definition

  • Musical Artists sometimes have a limit on the number of songs they can release in an album. Being able to predict which ones will do better can help them decide which songs to include in their album for a HIT !!!
  • Our aim is to come up with the best model to predict how popular a song could be (popularity category) based on it's audio features - song duration, acousticness, danceability, energy, instrumentalness, key, liveness, loudness, audio mode, speechiness, tempo, time signature and audio valence.

Models Used

Originally, the song popularity scores are from 0-100. We identified 4 categories, and want to predict the song's popularity category. Here are the popularity scores and their corresponding popularity categories:

Categorising table

We used 2 approaches to find the best model.

Firstly, we used regression to predict the numerical value of the popularity score, then convert that predicted score into the popularity category of 0-3, and check against the actual song popularity category.

The models we used for regression are :

  1. Multi-Variate Linear Regression
  2. Stepwise Linear Regression
  3. Support Vector Regression
  4. K-nearest Neighbours Regression

Our second approach is to categorise all the song popularity scores into categories first, then train our classification models on these popularity categories to directly predict the song popularity category.

The models we used for classification are :

  1. Decision Trees
  2. Random Forest
  3. Artificial Neural Networks

Conclusions

  • The best model to predict song popularity for this dataset is Random Forest (50.6% Accuracy)
  • Regression Models and Non-Oversampled Classification Models are not suitable in predicting song popularities from the lowest and highest categories in this dataset.
  • It is difficult to get a model with high accuracy to predict what song people would like, or dislike.

  • SMOTE Oversampling can reduce bias of models towards song popularity category 2 (the majority class), but reduces overall model accuracy for models trained on this dataset.

  • A larger and less imbalanced dataset could be needed to reduce bias, as well as noise from oversampling, and ultimately train better performing models for artistes to utilise and decide which songs to include in their albums, or which ones to leave out!

What did we learn from this project?

  • One hot encoding and dummy variable trap
  • Different models for regression - Stepwise, KNN, SVM.
  • SMOTE Oversampling technique
  • Different models for classification - Random Forest, Neural Networks
  • Hyperparameter optimization with keras tuner
  • Creating repositories and collaborating on Github

References

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published