Skip to content
/ dslr Public

Exploratory Data Analysis and Multinomial Logistic Regression from scratch

Notifications You must be signed in to change notification settings

ThePush/dslr

Repository files navigation

The Sorting Hat 🎩
EDA and Multinomial Logistic Regression 📈

Objectives:

This is a four parts project, the three first parts are about Exploratory Data analysis. The last part implements a class that performs classification via the Multinomial Logistic Regression algorithm (One vs All), with batch size as option to chose an optimization algorithm (such as stochastic gradient descent, batch or mini-batch gradient descent).

The goal of the project is to recreate the Sorting Hat from the Harry Potter series. We are provided with datasets that contain features such as the grades of students from different houses. Using these features, we make inferences on a new dataset to determine which student belongs to which house.

Exploratory Data Analysis

describe.py:

Custom implementation of the panda's library describe() function.
I added the mean absolute deviation (MAD) and the coefficient of variation (CV) to the output:

histogram.py:

This script tries to answer the question "Which Hogwarts course has a homogeneous score distribution between all four houses?".
We basically just look for the course that has the lowest standard deviation between houses (after having normalized all the grades):

The script will also display those distributions, one histogram per course:

scatter_plot.py:

This script tries to answer the question "What are the two features that are similar ?".
We compare the distribution of each feature by pair and see that the two similar features are Astronomy and Defense Against the Dark Arts because of the pattern they follow:

python3 scatter_plot.py <course1> <course2>

You can check that by calculating pearson's correlation coefficient for each pair of features:

python3 pearson_correlation.py

pair_plot.py:

This script will display a pair plot, all the histograms plus scatter plots that compare features by pair:


Multinomial Logistic Regression (One vs All)

Usage:

1/ To generate the model, plot the results and save the model in the theta.csv file, use the logreg_train.py script:

python3 logreg_train.py <datasets/dataset_train.csv>

Loss function evolution with mini-batch gradient descent (batch size of 64):

2/ Then use the model to predict classes. The script will use the model generated by the logreg_train.py script and saved in the theta.csv file.
It will make an inference on the datasets/dataset_test.csv, and save the results in a house.csv file.

python3 logreg_predict.py

3/ To evaluate the model, use the accuracy_score.py script:

python3 accuracy_score.py <houses.csv> <datasets/dataset_truth.csv>

Accuracy Score: 0.99

About

Exploratory Data Analysis and Multinomial Logistic Regression from scratch

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages