Performed data analysis and data visualisation on a subset of the Lending Club dataset and then created a logical regression model to assess whether or not a new customer is likely to meet it's debt obligations(pay back the loan).
- Introduction
- Dataset
- Exploratory Data Analysis
- Data PreProcessing
- Categorical Variables and Dummy Variables
- Scaling and Train Test Split
- Creating a Model
- Training the Model
- Evaluation on Test Data
- Predicting on a New Customer
Credit analysis is a type of analysis an investor or bond portfolio manager performs on companies or other debt issuing entities to measure the entity's ability to meet its debt obligations. The credit analysis seeks to identify the appropriate level of default risk associated with investing in that particular entity. One of the objectives of this notebook is to show step-by-step how to visualize the dataset and assess whether or not a new customer is likely to pay back the loan.
LendingClub is a US peer-to-peer lending company, headquartered in San Francisco, California. It was the first peer-to-peer lender to register its offerings as securities with the Securities and Exchange Commission, and to offer loan trading on a secondary market. LendingClub is the world's largest peer-to-peer lending platform.
Given historical data on loans given out with information on whether or not the borrower defaulted (charge-off), we can build a model that can predict if a borrower will pay back their loan. This way in the future when we get a new potential customer, we can assess if they are likely to pay back the loan.
The following questions will be answered throughout the Kernel:
Which features are available in the dataset? What is the distribution of numerical feature values across the samples? What is the length of the dataframe? What is the total count of missing values per column? How many unique employment job titles are there? Do you wonder how lending companies choose whether to give you money or not? How does a lending company decide how much money to give you? Would you offer this person a loan? Did this person actually end up paying back their loan?
We will be using a subset of the LendingClub DataSet obtained from Kaggle: https://www.kaggle.com/wordsforthewise/lending-club
There are many LendingClub data sets on Kaggle. Here is the information on this particular data set:
Get an understanding for which variables are important, view summary statistics, and visualize the data.
We use the Pearson correlation coefficient to examine the strength and direction of the linear relationship between two continuous variables.
The correlation coefficient can range in value from −1 to +1. The larger the absolute value of the coefficient, the stronger the relationship between the variables. For the Pearson correlation, an absolute value of 1 indicates a perfect linear relationship. A correlation close to 0 indicates no linear relationship between the variables.
The sign of the coefficient indicates the direction of the relationship. If both variables tend to increase or decrease together, the coefficient is positive, and the line that represents the correlation slopes upward. If one variable tends to increase as the other decreases, the coefficient is negative, and the line that represents the correlation slopes downward.
We can see a strong correlation between loan_amnt and installment. (The monthly payment owed by the borrower if the loan originates.)
This is an imbalance problem, because we have a lot more entries of people that fully paid their loans then people that did not pay back. We can expect to probably do very well in terms of accuracy but our precision and recall are going to be the true metrics that we will have to evaluate our model based off of. In the loan amount distribution we can see spikes in even ten thousend dollar, so this is indicating that there are certain amounts that are basically standard loans.
Essentially this is showing the percentage of charged off loans. Looks like it is increasing as the letter grade gets higher. Better grades are bluer and the worse grades are redder.
This is the main ML technique used which is directly based on 'regressional analysis' which data scientists use to evaluate the credit risk i.e. whether the customer will be able to payback the loan.
Other contents are directly processed as code along with the text in the source code file(Credit-Risk-Analysis.ipynb)
Would you offer this person a loan?
Did this person actually end up paying back their loan?