Credit_Risk_Analysis

MODULE 17

RESOURCES:

SOFTWARE: Jupyter Notebook, Pandas, Anaconda, Python, VSC

DATA: credit_risk_resampling_starter_code.ipynb, credit_risk_ensemble_starter_code.ipynb, LoanStats-2019Q1

OVERVIEW: The purpose of this challenge was to assist Jill assess credit card risk with various algorithms in machine learning, in order to determine which one is the most appropriate. This was completed by preprocessing and preparing the data, performing statistical measures and ML on a dataset obtained by LendingClub and entailed the following steps:

load data to Jupyter Notebook
train and test data with imbalanced-learn and scikit-learn
make predictions regarding the testing
evaluate the model with a confusion matrix
calculate accuracy
generate a classification report for precision, recall and F1
perform oversampling with Naive Random, then SMOTE ML's
perform undersampling
combo of over/under-sampling with SMOTEENN
compare 2 ML models with BalancedRandomForestClassifier and EasyEsembleClassifier
lastly, evaluate the performance of each and determine if they should be used to predict credit risk

RESULTS:

DELIVERABLE 1: USE RESAMPLING MODELS TO PREDICT CREDIT RISK

For this Deliverable, both the imbalanced-learn and scikit-learn libraries were utilized to evaluate three ML models via resampling (RandomOverSampler, SMOTE, ClusterCentroids) to see which is the most appropriate for predicting credit card risk. After resampling, the count of the target classes were then scrutinized, a logistic regression classifier was trained, a balanced accuracy score was calculated, a confusion matrix was generated, as well as an imbalanced classification report for analysis. The statistical measures calculated are listed and defined as follows:

-accuracy-how close result is to the correct value (TP+TN/Total)

-precision-how reliable the positive classification is (TP/TP+FP)

-recall-the capacity of a classifier to find all the positive values (TP/TP+FN)

-F1-harmonic mean between precision and sensitivity (2(precision*sensitivity)/precision+sensitivity))

                          FIGURE 1: Load Data

                     FIGURE 2: Split Into Train/Test

                      FIGURE 3: Naive Random OverSampling

                      FIGURE 4: SMOTE OverSampling

                      FIGURE 5: UnderSampling (ClusterCentroids)

DELIVERABLE 2: USE THE SMOTEENN ALGORITHM TO PREDICT CREDIT RISK

With the aforementioned libraries, this Deliverable will address the combinaion of over/under-sampling with SMOTEENN and compare the results with those obtained in Deliverable 1, in order to determine if this algorithm is better at calculating credit card risk.

                        FIGURE 6: Combo (Over/Under) Sampling

DELIVERABLE 3: USE ENSEMBLE CLASSIFIERS TO PREDICT CREDIT RISK

The last Deliverable will implement imblearn.ensemble library to train/compare the data to that of the prior Deliverables' data with BalancedRandomForestClassifier and EasyEnsembleClassifier.

                               FIGURE 7: Load Data

                      FIGURE 8: Split into Train/Test

                      FIGURE 9: Balanced Random Forest Classifier

                       FIGURE 10: Easy Ensemble AdaBoost Classifier

BULLETED RESULTS

NAIVE RANDOM OVERSAMPLING:

-accuracy: 64.56=64.56%

-precision: 0.99=99%

-recall: 0.68=68%

-F1: 0.81=81%

SMOTE OVERSAMPLING:

-accuracy: 0.6234=62.34%

-precision: 0.99=99%

-recall: 0.64=64%

-F1: 0.77=77%

UNDERSAMPLING:

-accuracy: 0.6234=62.34%

-precision: 0.99=99%

-recall: 0.45=45%

-F1: 0.62=62%

SMOTEENN:

-accuracy: 0.5177=51.90%

-precision: 0.99=99%

-recall: 0.58=58%

-F1: 0.73=73%

BALANCED RANDOM FOREST CLASSIFIER:

-accuracy: 0.7877=78.77%

-precision: 0 99=99%

-recall: 0.91=91%

-F1: 0.95=95%

EASY ENSEMBLE ADABOOST CLASSIFIER:

-accuracy: 0.9254=92.54%

-precision: 0.99=99%

-recall: 0.94=94%

-F1: 0.97=97%

SUMMARY:

RESULTS IN TABULAR FORMAT

Therefore; when considering the prior figures, lists and table, it can be ascertained that the Easy Ensemble Adaboost Classifier is the most appropriate machine learning model/algorithm for predicting credit card risk, as it out performed in ALL statistical categories. The accuracy score was 92.54% (the proximity of the result is close to the true value); all of the 6 ML models demonstrated 99% precision (high reliability); recall was 94% (high sensitivity); and a high F1 score 97% was observed (balance between precision and sensitivity).

The recommendation of ML model conclusively is the Easy Ensemble Adaboost Classifier.

REFERENCES: BCS, Google, GitHub, StackOverflow

Name		Name	Last commit message	Last commit date
Latest commit History 42 Commits
corrected images		corrected images
Del1-Naive Random Oversampling.png		Del1-Naive Random Oversampling.png
Del1-SMOTE Oversampling.png		Del1-SMOTE Oversampling.png
Del1-Undersampling.png		Del1-Undersampling.png
Del1-load data.png		Del1-load data.png
Del1-split into train:test.png		Del1-split into train:test.png
Del2_Combo Sampling.png		Del2_Combo Sampling.png
Del3-Bal Random Forest Classifier.png		Del3-Bal Random Forest Classifier.png
Del3-Easy Ensemble AdaBoost Classifier.png		Del3-Easy Ensemble AdaBoost Classifier.png
Del3-load data.png		Del3-load data.png
Del3-split into train:test.png		Del3-split into train:test.png
README.md		README.md
cr_resampling_corrected.ipynb		cr_resampling_corrected.ipynb
credit_risk_ensemble.ipynb		credit_risk_ensemble.ipynb
credit_risk_ensemble_starter_code.ipynb		credit_risk_ensemble_starter_code.ipynb
credit_risk_resampling.ipynb		credit_risk_resampling.ipynb
credit_risk_resampling_starter_code.ipynb		credit_risk_resampling_starter_code.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Credit_Risk_Analysis

About

Releases

Packages

Languages

tahczeban/Credit_Risk_Analysis

Folders and files

Latest commit

History

Repository files navigation

Credit_Risk_Analysis

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages