Credit Risk Analysis

Project Overview

The purpose of this project is to employ different techniques to train and evaluate models with unbalanced classes. With the credit dataset from LendingClub, several different algorithms are used to predict credit risk. The performance of these different models are compared and recommendations are suggested based on the results.

Software

Python 3.7
SciPy 1.6.2
Scikit-learn 0.24.1
imbalanced-learn 0.80

Results

Below are the resampling models have been used in this project and their individual results.

Oversampling: Native Random Oversampling

In random oversampling, instances of the minority class are randomly selected and added to the training set until the majority and minority classes are balanced.

Balanced Accurracy Score: 0.674
High-Risk Precision: 0.01
High-Risk Recall: 0.74

Oversampling: SMOTE

Synthetic Minority Oversampling Technique (SMOTE) is an oversampling approach to deal with unbalanced datasets.

Balanced Accurracy Score: 0.662
High-Risk Precision: 0.01
High-Risk Recall: 0.63

Undersampling: Random Undersampling

In random undersampling, instances are randomly selected from the majority class and removed until the size of the majority class is reduced (typically to the size of the minority class).

Balanced Accurracy Score: 0.662
High-Risk Precision: 0.01
High-Risk Recall: 0.63

Combination Sampling: SMOTEENN

SMOTEENN is combination of SMOTE and Edited Nearest Neighbor (ENN) algorithms. This involves a two step process:

Oversample the minority class with SMOTE.
Clean the resulting data with an undersampling strategy. If the two nearest neighbors of a data point belong to two different classes, that data point is dropped.

Balanced Accurracy Score: 0.644
High-Risk Precision: 0.01
High-Risk Recall: 0.72

Balanced Random Forest Classifier

A Balanced Random Forest is an ensemble method that randomly under-samples to achieve balance.

Balanced Accurracy Score: 0.788
High-Risk Precision: 0.03
High-Risk Recall: 0.70

Easy Ensemble AdaBoost Classifier

Bag of balanced boosted learners also known as EasyEnsemble. The balancing is achieved by random under-sampling.

Balanced Accurracy Score: 0.931
High-Risk Precision: 0.09
High-Risk Recall: 0.92

Summary

Easy Ensemble AdaBoost Classifier has the highest Balanced Accuracy Score out of all of the techniques employed in this project. EasyEnsemble also has the highest precision, recall, and F1 scores as proven by the imbalanced classification reports printed for each model.

If one of the models in this project must be used to predict credit risk, it is recommended to use Easy Ensemble AdaBoost Classifier. However, if given the option, it is not suggested to use any of these models. Although the EasyEnsemble model performed the best in this group, its precision for high-risk data points is still only 0.09. This indicates that very few of positive predictions are true (False Positives). Further research and development are required to create a more robust model that can predict credit risk.

Author: Michael Mishkanian
For all questions and inquiries, please contact me on LinkedIn.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Credit Risk Analysis

Project Overview

Software

Results

Oversampling: Native Random Oversampling

Oversampling: SMOTE

Undersampling: Random Undersampling

Combination Sampling: SMOTEENN

Balanced Random Forest Classifier

Easy Ensemble AdaBoost Classifier

Summary

Files

README.md

Latest commit

History

README.md

File metadata and controls

Credit Risk Analysis

Project Overview

Software

Results

Oversampling: Native Random Oversampling

Oversampling: SMOTE

Undersampling: Random Undersampling

Combination Sampling: SMOTEENN

Balanced Random Forest Classifier

Easy Ensemble AdaBoost Classifier

Summary