The purpose of this project is to employ different techniques to train and evaluate models with unbalanced classes. With the credit dataset from LendingClub, several different algorithms are used to predict credit risk. The performance of these different models are compared and recommendations are suggested based on the results.
- Python 3.7
- SciPy 1.6.2
- Scikit-learn 0.24.1
- imbalanced-learn 0.80
Below are the resampling models have been used in this project and their individual results.
In random oversampling, instances of the minority class are randomly selected and added to the training set until the majority and minority classes are balanced.
- Balanced Accurracy Score: 0.674
- High-Risk Precision: 0.01
- High-Risk Recall: 0.74
Synthetic Minority Oversampling Technique (SMOTE) is an oversampling approach to deal with unbalanced datasets.
- Balanced Accurracy Score: 0.662
- High-Risk Precision: 0.01
- High-Risk Recall: 0.63
In random undersampling, instances are randomly selected from the majority class and removed until the size of the majority class is reduced (typically to the size of the minority class).
- Balanced Accurracy Score: 0.662
- High-Risk Precision: 0.01
- High-Risk Recall: 0.63
SMOTEENN is combination of SMOTE and Edited Nearest Neighbor (ENN) algorithms. This involves a two step process:
- Oversample the minority class with SMOTE.
- Clean the resulting data with an undersampling strategy. If the two nearest neighbors of a data point belong to two different classes, that data point is dropped.
- Balanced Accurracy Score: 0.644
- High-Risk Precision: 0.01
- High-Risk Recall: 0.72
A Balanced Random Forest is an ensemble method that randomly under-samples to achieve balance.
- Balanced Accurracy Score: 0.788
- High-Risk Precision: 0.03
- High-Risk Recall: 0.70
Bag of balanced boosted learners also known as EasyEnsemble. The balancing is achieved by random under-sampling.
- Balanced Accurracy Score: 0.931
- High-Risk Precision: 0.09
- High-Risk Recall: 0.92
Easy Ensemble AdaBoost Classifier has the highest Balanced Accuracy Score out of all of the techniques employed in this project. EasyEnsemble also has the highest precision, recall, and F1 scores as proven by the imbalanced classification reports printed for each model.
If one of the models in this project must be used to predict credit risk, it is recommended to use Easy Ensemble AdaBoost Classifier. However, if given the option, it is not suggested to use any of these models. Although the EasyEnsemble model performed the best in this group, its precision for high-risk data points is still only 0.09. This indicates that very few of positive predictions are true (False Positives). Further research and development are required to create a more robust model that can predict credit risk.
Author: Michael Mishkanian
For all questions and inquiries, please contact me on LinkedIn.