Welcome to the Credit Risk Classification repository! In this project, you'll embark on an exciting journey to develop a model that can assess and predict loan risk. Your mission is to leverage various data science techniques to analyze historical lending activity from a peer-to-peer lending services company. Ultimately, you'll build a powerful model capable of identifying the creditworthiness of borrowers.
Lending institutions face the ongoing challenge of assessing loan applicants' creditworthiness to minimize financial risk. In this challenge, you will work with a rich dataset containing historical lending data. Your primary objectives include:
-
Data Exploration: Dive into the dataset to gain a deep understanding of its structure, features, and any potential data quality issues.
-
Feature Engineering: Create relevant features that can help the model make accurate predictions about loan risk.
-
Model Training: Implement and train machine learning models using a variety of techniques, such as classification algorithms, to predict loan risk.
-
Model Evaluation: Assess the performance of your models using appropriate evaluation metrics and techniques, such as cross-validation and hyperparameter tuning.
-
Interpretability: Aim to create models that provide insights into why a particular loan application is deemed risky or not.
The instructions are are divided into the following subsections:
-
Split the Data into Training and Testing Sets
-
Create a Logistic Regression Model with the Original Data
-
Write a Credit Risk Analysis Report
Open the starter code notebook and use it to complete the following steps:
-
Read the
lending_data.csv
data from the Resources folder into a Pandas DataFrame. -
Create the labels set (
y
) from the “loan_status” column, and then create the features (X
) DataFrame from the remaining columns. A value of 0 in the “loan_status” column means that the loan is healthy. A value of 1 means that the loan has a high risk of defaulting. -
Split the data into training and testing datasets by using
train_test_split
.
Use your knowledge of logistic regression to complete the following steps:
-
Fit a logistic regression model by using the training data (
X_train
andy_train
). -
Save the predictions for the testing data labels by using the testing feature data (
X_test
) and the fitted model. -
Evaluate the model’s performance by doing the following:
-
Generate a confusion matrix.
-
Print the classification report.
-
Analysis Overview: The purpose of this analysis was to identify whether or not credit borrowers were considered high-risk or safe. This was done by building a model using a dataset of lending activity. When consulting the classification report, 0 represents a healthy loan (the borrower will be pay back the loan) while 1 represents a high-risk loan.
-
The results: Using a bulleted list, describe the accuracy score, the precision score, and recall score of the machine learning model.
-
Accuracy score : The accuracy score was .99, or 99%, meaning the model performed with almost perfect accuracy.
-
Precision score : The percentage of correct positive predictions to total positive predictions. Out of all the borrowers the model predicted would be considered high-risk, 85% were actually high-risk. Out of all the borrowers the model predicted to be safe, 100% of them were safe.
-
Recall score: Out of all the borrowers that were labeled high-risk, the model predicted the outcome correctly for 91% of them. Out of all the borrwers labeled safe, the model predicted the outcome correctly for 99% of them.
- A summary: Summarize the results from the machine learning model. I would recommend the model be used by the company because it predicted the outcome of repayment with 99% accuracy.
Data for this dataset was generated by edX Boot Camps LLC, and is intended for educational purposes only.