I went through all the fundamental steps you need to preprocess a large dataset and then used the Linear Regression model.
Deep Neural Networks were used to beat the Mean Absoulte Error of the baseline model.
The dataset can be downloaded here.
Use the Pandas Profiling notebook only if you want to learn it, else use the "01_Linear_Regression.ipynb" file.
This notebook is divided into 5 portions:
I used the built in Pandas Profiling to generate a profiling report in Colab Notebook.
Feature selection was done based on missing values, feature correlation and Backward Elimination. All these methods are described briefly.
Missing values were filled using mean and categorical columns were coded using cat.code.
Just a trivial visualization of the value distribution among all the columns.
A Linear Regression model was used to fit the preprocessed data and then then I used Mean Absolute Error and Mean Squared Error as evaluation methods.