-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy pathdata_clean_and_feat_engg.Rmd
218 lines (158 loc) · 8.9 KB
/
data_clean_and_feat_engg.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
### Data Preparation
#### Load necessary packages
```{r setup, include=FALSE}
library(flexdashboard)
library(plotly)
library(dplyr)
library(data.table)
library(broom)
library(scales) # package to convert number to currency
library(stringr) # package to handle strings
library(lubridate) # package to handle dates
library(VIM) # package to report descriptive statistics
library(highcharter) # package to generate plots
```
```{}
Run this section of code if you downloaded data using `download_data.R` script in this project. Else skip directly to line 163 to run load pre-cleaned data.
```
#### Load data
```{r}
filepath <- paste(getwd(),"/Data/",sep = "")
accepted_loan_df <- readRDS(paste(filepath,"accepted_loan_data.Rds",sep = ""))
```
Data dimensions and column types
```{r}
dim(accepted_loan_df)
str(accepted_loan_df)
```
There are a lot of feature that can be dropped and pick only some due to limit scope of analysis in this example.
#### Cleaning
#### Let's filter to a few columns
```{r}
accepted_loan_subset_df<-subset(accepted_loan_df,
select= c(id,
issue_d,
loan_amnt,
annual_inc,
title,
dti,
open_acc,
total_acc,
term,
addr_state,
emp_length,
policy_code,
earliest_cr_line,
grade,
loan_status,
purpose,
int_rate))
rm(list = ls()[!ls() %in% c("accepted_loan_subset_df", "filepath")])
```
#### Check missing values
```{r}
summary(aggr(accepted_loan_subset_df))
```
#### Feature Engineering
I utilized the full data that I downloaded by logging in andcalculated Risk Score from fico score columns.
```{r}
df_from_full_data <- fread(paste(getwd(),"/Downloaded Data/accepted_loan_full_data.csv", sep = ""),
select = c("fico_range_high","fico_range_low"))
# Average of FICO Score Range - Match column of rejected loan
accepted_loan_subset_df$Risk_Score <- (df_from_full_data$fico_range_low + df_from_full_data$fico_range_high)/2
rm(list = ls()[!ls() %in% c("accepted_loan_subset_df", "filepath")])
```
Based on the description provided in Lending Club data dictionary, it appears `loan_status` is going to be an important variable in starting our analysis. So let's take a look at levels in this column.
```{r}
accepted_loan_subset_df %>% group_by(loan_status) %>% summarise(count = n())
```
Let's correct this data further.
```{r}
## Recode Grade to calculate borrower score and add it to new column newGrade
accepted_loan_subset_df$newGrade <- accepted_loan_subset_df$grade
accepted_loan_subset_df$newGrade <- as.numeric(recode(accepted_loan_subset_df$newGrade, "A" = 1, "B" = 0.8, "C" = 0.7, "D" = 0.6, "E" = 0.5, "F" = 0.4, "G" = 0.3))
accepted_loan_subset_df$loan_status[accepted_loan_subset_df$loan_status=="Does not meet the credit policy. Status:Charged Off"]<-"Charged Off"
accepted_loan_subset_df$loan_status[accepted_loan_subset_df$loan_status=="Does not meet the credit policy. Status:Fully Paid"]<-"Fully Paid"
table(accepted_loan_subset_df$loan_status, accepted_loan_subset_df$term)
```
We can create three major categories in this column, mainly loans that are `Default`, `Paid off` or they are `Current`.
`Charged Off`, `Default`, `In Grace Period`, `Late (16-30 days)`, and `Late (31-120 days)` values are categorized to a single category: `Default`. The data is moderately imbalance.
```{r}
accepted_loan_subset_df$loan_status[accepted_loan_subset_df$loan_status=="Charged Off"]<-"Default"
accepted_loan_subset_df$loan_status[accepted_loan_subset_df$loan_status=="In Grace Period"]<-"Default"
accepted_loan_subset_df$loan_status[accepted_loan_subset_df$loan_status=="Late (16-30 days)"]<-"Default"
accepted_loan_subset_df$loan_status[accepted_loan_subset_df$loan_status=="Late (31-120 days)"]<-"Default"
table(accepted_loan_subset_df$loan_status, accepted_loan_subset_df$term)
```
Several features of the data set were inherently related, lending themselves naturally to feature engineering. For example, I used `lubridate` alongside the loan issue date and the date of the borrower’s first credit line to calculate the length of each borrower’s credit history.
```{r}
# Add a day (01) to make it a full date column as its easy to operate later
accepted_loan_subset_df$issue_d <- as.character(accepted_loan_subset_df$issue_d)
accepted_loan_subset_df$issue_d <- paste(accepted_loan_subset_df$issue_d, "-01", sep = "")
accepted_loan_subset_df$issue_d <- parse_date_time(accepted_loan_subset_df$issue_d, "myd")
# accepted_loan_subset_df$month=str_split_fixed(accepted_loan_subset_df$issue_d, "-", 2)[,1] # Extract Month Ch
# accepted_loan_subset_df$year=str_split_fixed(accepted_loan_subset_df$issue_d, "-", 2)[,2] # Extract Year
accepted_loan_subset_df$month <- month(accepted_loan_subset_df$issue_d, label = T)
accepted_loan_subset_df$year <- year(accepted_loan_subset_df$issue_d)
accepted_loan_subset_df$earliest_cr_line <- as.character(accepted_loan_subset_df$earliest_cr_line)
accepted_loan_subset_df$earliest_cr_line <- paste(accepted_loan_subset_df$earliest_cr_line, "-01", sep = "")
accepted_loan_subset_df$earliest_cr_line <- parse_date_time(accepted_loan_subset_df$earliest_cr_line, "myd")
accepted_loan_subset_df$time_since_first_credit <- accepted_loan_subset_df$issue_d - accepted_loan_subset_df$earliest_cr_line
accepted_loan_subset_df$time_since_first_credit <- as.numeric(accepted_loan_subset_df$time_since_first_credit)
accepted_loan_subset_df <- accepted_loan_subset_df %>% filter(time_since_first_credit > 0)
head(accepted_loan_subset_df$time_since_first_credit)
## Remove rows that do not have any Grade
accepted_loan_subset_df <- accepted_loan_subset_df %>%
filter(grade != "")
## Convert int_rate column to Numeric
accepted_loan_subset_df$InterestRate <- substr(unlist(strsplit(accepted_loan_subset_df$int_rate, "/")),1,nchar(unlist(strsplit(accepted_loan_subset_df$int_rate, "/")))-1) %>% as.numeric()
```
Another engineered feature, current.account.ratio, was calculated by dividing open.acc (the number of open credit lines the borrower had at the time of the loan) by total.acc (the total number of credit lines the borrower has had).
```{r}
accepted_loan_subset_df$CurrentAccountRatio <- accepted_loan_subset_df$open_acc/accepted_loan_subset_df$total_acc
```
```{r}
# Save all data in environment to disk
#save.image()
```
```{r}
# Save the pre-selected dataset
save(accepted_loan_subset_df, file = "./Data/accepted_loan_subset_data.RData")
# Load above saved data
# load("~/R/myR/Loan Default Prediction on Lending Club data/Data/accepted_loan_subset_data.RData")
```
### Exploratory Analysis
#### Load already cleanrd up data.
```{r}
load("Data/accepted_loan_subset_data.RData")
```
To start examining the data, I began by investigating the distributions of each numeric feature via histograms, segmented out by loan outcome.
The interest rate analysis was of particular interest. Examining the histogram, fully paid loans are clearly clumped at lower interest rates, while charged off loans have a more even distribution, tending towards higher interest rates more frequently than the fully paid ones do. This result makes intuitive sense, as higher interest rates are assigned to riskier investments.
```{r}
loan_heatmap <- accepted_loan_subset_df %>%
group_by(loan_status, year) %>%
summarize(loan_amount = median(loan_amnt))
hchart(loan_heatmap, "heatmap", hcaes(x = loan_status, y = year, value = loan_amount))
```
```{r}
```
```{r}
hchart(density(accepted_loan_subset_df$loan_amnt), type = "area", color = "#B71C1C", name = "Loan Status Data Distribution")
```
All issued loan data was downloaded from a registered Lending Club account. The data can be downloaded from the Lending Club statistics webpage. A rapid growth of issued loans can be observed in recent years. The data for 2017 is only upto 2017Q3.
```{r}
issued_loans_by_year <- accepted_loan_subset_df %>%
group_by(year) %>%
summarise(TotalLoan = sum(as.numeric(loan_amnt),na.rm = T),
NumberofLoans = n()) %>%
as.data.frame()
issued_loans_by_year$year[1] <- NA
issued_loans_by_year <- issued_loans_by_year[complete.cases(issued_loans_by_year),]
yr <- levels(factor(issued_loans_by_year$year))
plot_ly(issued_loans_by_year,x=~year,
y=~TotalLoan,
text = dollar_format()(issued_loans_by_year$TotalLoan),
type='bar') %>%
layout(title = "All Years")
```
#### Descriptive Statistics