MODULE 17
RESOURCES:
SOFTWARE: Jupyter Notebook, Pandas, Anaconda, Python, VSC
DATA: credit_risk_resampling_starter_code.ipynb, credit_risk_ensemble_starter_code.ipynb, LoanStats-2019Q1
OVERVIEW: The purpose of this challenge was to assist Jill assess credit card risk with various algorithms in machine learning, in order to determine which one is the most appropriate. This was completed by preprocessing and preparing the data, performing statistical measures and ML on a dataset obtained by LendingClub and entailed the following steps:
- load data to Jupyter Notebook
- train and test data with imbalanced-learn and scikit-learn
- make predictions regarding the testing
- evaluate the model with a confusion matrix
- calculate accuracy
- generate a classification report for precision, recall and F1
- perform oversampling with Naive Random, then SMOTE ML's
- perform undersampling
- combo of over/under-sampling with SMOTEENN
- compare 2 ML models with BalancedRandomForestClassifier and EasyEsembleClassifier
- lastly, evaluate the performance of each and determine if they should be used to predict credit risk
RESULTS:
DELIVERABLE 1: USE RESAMPLING MODELS TO PREDICT CREDIT RISK
For this Deliverable, both the imbalanced-learn and scikit-learn libraries were utilized to evaluate three ML models via resampling (RandomOverSampler, SMOTE, ClusterCentroids) to see which is the most appropriate for predicting credit card risk. After resampling, the count of the target classes were then scrutinized, a logistic regression classifier was trained, a balanced accuracy score was calculated, a confusion matrix was generated, as well as an imbalanced classification report for analysis. The statistical measures calculated are listed and defined as follows:
-accuracy-how close result is to the correct value (TP+TN/Total)
-precision-how reliable the positive classification is (TP/TP+FP)
-recall-the capacity of a classifier to find all the positive values (TP/TP+FN)
-F1-harmonic mean between precision and sensitivity (2(precision*sensitivity)/precision+sensitivity))
FIGURE 1: Load Data
FIGURE 2: Split Into Train/Test
FIGURE 3: Naive Random OverSampling
FIGURE 4: SMOTE OverSampling
FIGURE 5: UnderSampling (ClusterCentroids)
DELIVERABLE 2: USE THE SMOTEENN ALGORITHM TO PREDICT CREDIT RISK
With the aforementioned libraries, this Deliverable will address the combinaion of over/under-sampling with SMOTEENN and compare the results with those obtained in Deliverable 1, in order to determine if this algorithm is better at calculating credit card risk.
FIGURE 6: Combo (Over/Under) Sampling
DELIVERABLE 3: USE ENSEMBLE CLASSIFIERS TO PREDICT CREDIT RISK
The last Deliverable will implement imblearn.ensemble library to train/compare the data to that of the prior Deliverables' data with BalancedRandomForestClassifier and EasyEnsembleClassifier.
FIGURE 7: Load Data
FIGURE 8: Split into Train/Test
FIGURE 9: Balanced Random Forest Classifier
FIGURE 10: Easy Ensemble AdaBoost Classifier
BULLETED RESULTS
NAIVE RANDOM OVERSAMPLING:
-accuracy: 64.56=64.56%
-precision: 0.99=99%
-recall: 0.68=68%
-F1: 0.81=81%
SMOTE OVERSAMPLING:
-accuracy: 0.6234=62.34%
-precision: 0.99=99%
-recall: 0.64=64%
-F1: 0.77=77%
UNDERSAMPLING:
-accuracy: 0.6234=62.34%
-precision: 0.99=99%
-recall: 0.45=45%
-F1: 0.62=62%
SMOTEENN:
-accuracy: 0.5177=51.90%
-precision: 0.99=99%
-recall: 0.58=58%
-F1: 0.73=73%
BALANCED RANDOM FOREST CLASSIFIER:
-accuracy: 0.7877=78.77%
-precision: 0 99=99%
-recall: 0.91=91%
-F1: 0.95=95%
EASY ENSEMBLE ADABOOST CLASSIFIER:
-accuracy: 0.9254=92.54%
-precision: 0.99=99%
-recall: 0.94=94%
-F1: 0.97=97%
SUMMARY:
RESULTS IN TABULAR FORMAT
Therefore; when considering the prior figures, lists and table, it can be ascertained that the Easy Ensemble Adaboost Classifier is the most appropriate machine learning model/algorithm for predicting credit card risk, as it out performed in ALL statistical categories. The accuracy score was 92.54% (the proximity of the result is close to the true value); all of the 6 ML models demonstrated 99% precision (high reliability); recall was 94% (high sensitivity); and a high F1 score 97% was observed (balance between precision and sensitivity).
The recommendation of ML model conclusively is the Easy Ensemble Adaboost Classifier.
REFERENCES: BCS, Google, GitHub, StackOverflow