The goal of this project is to predict customer churn* by using the given features.
*Churn definition: Amount of customers that stopped using your company's product or service during a certain time frame
train.csv - the training set test.csv - the test set
Both the train and test datasets are made of 20 features, including the target feature ("churn").
The scrip is entirely made on Apache Spark framework.
After the basic exploratory analysis is in place, e.g. looking for NA values, data balance, etc., we see the following correlation matrix:
We can see from this matrix that there are features with very high correlation with each other, such as total_day_minutes x total_day_charge (probably because the customer is charged by minutes of use).
The first step is to remove this features, resulting in the following correlation matrix:
Following, two specific features are analysed more deeply, and new level features are made according to their clear relation to the target:
Finally, the least significant features related to the target are removed (correlation less than 0.1)
The data created from the previous script is loaded and the predictions are taken place.
Gradient Boosting Classifier was used, giving the following result in the test dataset:
Gradient Boosting Classifier accuracy: 0.91