Despite the steady transformation over the decades, many banks today with a sizeable customer base hoping to gain a competitive edge.
While retaining existing customers and thereby increasing their lifetime value is something everyone acknowledges as being important, there is little the banks can do about customer churn when they donβt see it coming in the first place.
This is where predicting churn at the right time becomes important, especially when clear customer feedback is absent.Early and accurate churn prediction empowers CRM and customer experience teams to be creative and proactive in their engagement with the customer.
In this project our goal is to predict the probability of a customer is likely to churn using machine learning techniques.
Predicting Churn for Bank Customers
Libraries: sklearn
Matplotlib
pandas
seaborn
NumPy
Scipy
From the above chart, we can say that our traget variable is imbalanced.
We need to know which the important features are. In order to find that out, we trained the model using the Random Forest classifier.
The graph above shows the features with the highest importance value to the lowest importance value.
Since we are modeling a critic problem for that we need model with high performance possible. Here, we will try a couple of different machine learning algorithms in order to get an idea about which machine learning algorithm performs better. Also, we will perform a accuracy comparison amoung them. As our problem is a classification problem, the algorithms that we are going to choose are as follows:
- K-Nearest Neighbor (KNN)
- Logistic Regression (LR)
- AdaBoost
- Gradient Boosting (GB)
- RandomForest (RF)
parameters_list = {"algorithm" : ["SAMME","SAMME.R"],
"n_estimators" :[10,50,100,200,400]}
GSA = RandomizedSearchCV(AdaBoostClassifier(), param_distributions=parameters_list, n_iter=10, scoring = "roc_auc")
GSA.fit(X_train, y_train)
GSA.best_params_, GSA.best_score_
({'n_estimators': 200, 'algorithm': 'SAMME'}, 0.8432902741161931)
gb_parameters_list = {'loss' : ['deviance', 'exponential'],
'n_estimators': randint(10, 500),
'max_depth': randint(1,10)}
GBM = RandomizedSearchCV(GradientBoostingClassifier(), param_distributions=gb_parameters_list, n_iter=10, scoring="roc_auc")
GBM.fit(X_train, y_train)
GBM.best_params_, GBM.best_score_
({'loss': 'exponential', 'max_depth': 3, 'n_estimators': 241},
0.8576619853133595)
'AdaBoostClassifier': 0.8442783055508478
'GradientBoostingClassifier': 0.873749653401012
voting_model = VotingClassifier(estimators=[("gb", GBM_fit_transformed),
("ADA", GSA_fit_transformed)],
voting = 'soft', weights = [2,1])
votingModel = voting_model.fit(X_train_transform, y_train)
test_labels_voting = votingModel.predict_proba(np.array(X_test_transform))[:,1]
votingModel.score(X_test_transform, y_test)
0.8732
roc_auc_score(y_test,test_labels_voting, average = 'macro', sample_weight = None)
0.8744660402064695
Data Imputation
Handling Outliers
Feature Engineering
Classification Models
Voting
If you have any feedback, please reach out at pradnyapatil671@gmail.com
I am an AI Enthusiast and Data science & ML practitioner