Skip to content

Latest commit

 

History

History
74 lines (53 loc) · 4.61 KB

README.md

File metadata and controls

74 lines (53 loc) · 4.61 KB

Credit Risk Analysis

Project Overview

The purpose of this project is to employ different techniques to train and evaluate models with unbalanced classes. With the credit dataset from LendingClub, several different algorithms are used to predict credit risk. The performance of these different models are compared and recommendations are suggested based on the results.

Software

  • Python 3.7
  • SciPy 1.6.2
  • Scikit-learn 0.24.1
  • imbalanced-learn 0.80

Results

Below are the resampling models have been used in this project and their individual results.

Oversampling: Native Random Oversampling

In random oversampling, instances of the minority class are randomly selected and added to the training set until the majority and minority classes are balanced.

  • Balanced Accurracy Score: 0.674
  • High-Risk Precision: 0.01
  • High-Risk Recall: 0.74

random_over

Oversampling: SMOTE

Synthetic Minority Oversampling Technique (SMOTE) is an oversampling approach to deal with unbalanced datasets.

  • Balanced Accurracy Score: 0.662
  • High-Risk Precision: 0.01
  • High-Risk Recall: 0.63

smote

Undersampling: Random Undersampling

In random undersampling, instances are randomly selected from the majority class and removed until the size of the majority class is reduced (typically to the size of the minority class).

  • Balanced Accurracy Score: 0.662
  • High-Risk Precision: 0.01
  • High-Risk Recall: 0.63

random_under

Combination Sampling: SMOTEENN

SMOTEENN is combination of SMOTE and Edited Nearest Neighbor (ENN) algorithms. This involves a two step process:

  1. Oversample the minority class with SMOTE.
  2. Clean the resulting data with an undersampling strategy. If the two nearest neighbors of a data point belong to two different classes, that data point is dropped.
  • Balanced Accurracy Score: 0.644
  • High-Risk Precision: 0.01
  • High-Risk Recall: 0.72

smoteenn

Balanced Random Forest Classifier

A Balanced Random Forest is an ensemble method that randomly under-samples to achieve balance.

  • Balanced Accurracy Score: 0.788
  • High-Risk Precision: 0.03
  • High-Risk Recall: 0.70

balanced_forest

Easy Ensemble AdaBoost Classifier

Bag of balanced boosted learners also known as EasyEnsemble. The balancing is achieved by random under-sampling.

  • Balanced Accurracy Score: 0.931
  • High-Risk Precision: 0.09
  • High-Risk Recall: 0.92

adaboost

Summary

Easy Ensemble AdaBoost Classifier has the highest Balanced Accuracy Score out of all of the techniques employed in this project. EasyEnsemble also has the highest precision, recall, and F1 scores as proven by the imbalanced classification reports printed for each model.

If one of the models in this project must be used to predict credit risk, it is recommended to use Easy Ensemble AdaBoost Classifier. However, if given the option, it is not suggested to use any of these models. Although the EasyEnsemble model performed the best in this group, its precision for high-risk data points is still only 0.09. This indicates that very few of positive predictions are true (False Positives). Further research and development are required to create a more robust model that can predict credit risk.

Author: Michael Mishkanian
For all questions and inquiries, please contact me on LinkedIn.