-
Notifications
You must be signed in to change notification settings - Fork 5
/
lending-rate-prediction.qmd
767 lines (623 loc) · 27.7 KB
/
lending-rate-prediction.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
---
title: "Predicting lending rates with Databricks and tidymodels"
format:
html:
css: styles.css
code-fold: true
code-summary: Show the code
code-links:
- text: "GitHub repository"
icon: github
target: blank
href: https://github.com/posit-marketing/finance-app/
- text: "Model pin"
icon: pin
target: blank
href: https://pub.demo.posit.team/public/lend-fit-model-pin/
- text: "Vetiver API"
icon: arrow-left-right
target: blank
href: https://pub.demo.posit.team/public/lend-fit-vetiver-api/
- text: "Shiny app"
icon: window
target: blank
href: https://pub.demo.posit.team/public/lend-rate-prediction-app/
title-block-banner: banner.png
toc: true
filters:
- shinylive
resource_files:
- banner.png
---
Machine learning algorithms are reshaping financial decision-making, changing how the industry understands and manages financial risk. By analyzing vast amounts of data, these advanced algorithms deliver predictive insights that drive informed decisions and expedite client service. One example is in the consumer credit market, where accurately predicting lending rates is critical for customer acquisition and retention.
By analyzing historical data, machine learning enables the capability to offer potential clients personalized lending rates, quickly. This mitigates the chance of losing the customers to faster alternatives while minimizing loan-associated risks. Efficient but thorough application processing allows you to maintain a competitive edge in the market.
Financial analysts using Databricks can harness its performance and robust data governance capabilities, particularly when working with common datasets stored in Delta Lake. Analysts can use [ODBC](https://odbc.r-dbi.org/) to link with a SQL warehouse or [sparklyr](https://spark.posit.co/deployment/databricks-connect.html) to interface with a Databricks cluster.
Once the data is accessible, the [tidymodels framework of packages](https://www.tidymodels.org/) offer modeling and machine learning capabilities. Analysts can use these tools in managed environments with their preferred IDE in [Posit Workbench](https://posit.co/products/enterprise/workbench/) and deploy and share their models on [Posit Connect](https://posit.co/products/enterprise/connect/). The integration of tools combines the data governance strengths of Databricks with the powerful tools offered by Posit, making their work more productive and streamlined.
In this article, we will use publicly accessible loan applicant data from [LendingClub](https://www.lendingclub.com/) to create a machine learning model.^[Thank you to James Andersen for the blog post Predicting Lending Rates: An Intro to AWS Machine Learning that served as a starting off point for this post.] Our goal is to develop a personalized lending rate prediction model tailored to individual criteria.
## Databricks Delta Lake connection to RStudio
For those with access to Databricks, add the table to your catalog by running the below:
```sql
CREATE TABLE lending_club USING com.databricks.spark.csv OPTIONS(path 'dbfs:/databricks-datasets/lending-club-loan-stats/LoanStats_2018Q2.csv', header "true");
```
Let's start by loading the packages that will be integral to our workflow.
```{r}
#| label: setup
#| message: false
# Importing data
library(DBI)
# Wrangling packages
library(dplyr)
library(tidyr)
library(lubridate)
library(stringr)
# Visualization packages
library(dbplot)
library(ggplot2)
# Modeling packages
library(ranger)
library(rsample)
library(recipes)
library(parsnip)
library(workflows)
library(yardstick)
# MLOps
library(pins)
library(plumber)
library(vetiver)
```
As previously mentioned, analysts have several options for accessing Databricks data. In this walk-through, we'll demonstrate connectivity using the [odbc package](https://odbc.r-dbi.org/). The Posit Solutions site provides detailed guidance on using the new [odbc `databricks()` function](https://solutions.posit.co/connections/db/databases/databricks/index.html#using-the-new-odbcdatabricks-function). Essentially, we need to store a Databricks token and host URL in our R environment. The function itself requires only the HTTP path to the SQL warehouse.
```{r}
#| label: odbc-connection
# Sys.setenv("DATABRICKS_HOST" = "your-databricks-host.com")
# Sys.setenv("DATABRICKS_TOKEN" = "your-databricks-token")
con <-
dbConnect(odbc::databricks(), httpPath = Sys.getenv("HTTP_PATH"))
```
This code snippet establishes a connection using the `con` object. Subsequently, we can use dplyr and dbplyr to navigate to our table. We retrieve the `lendingclub` table using `tbl()` and store it in an object named `lendingclub_dat`.
```{r}
#| label: get-tbl-data
lendingclub_dat <-
tbl(con,
dbplyr::in_catalog("hive_metastore", "default", "lendingclub"))
```
We can use the [dbplot package](https://edgararuiz.github.io/dbplot/) to generate a ggplot without transferring the data into R. This visualization displays the distribution of the variable we aim to predict, lending rate. Loan amounts are plotted on the x-axis, while the frequency of loans at each amount is depicted on the y-axis. It's shown that the majority of loans have an lending rate of less than 20%.
```{r}
#| label: dbplot-int-rate
#| fig-asp: 0.618
#| warning: false
lendingclub_dat |>
mutate(int_rate = as.numeric(REGEXP_REPLACE(int_rate, "%", ""))) |>
db_compute_bins(int_rate, binwidth = 0.5) |>
ggplot() +
geom_col(
aes(x = int_rate, y = count),
fill = "#1B909E",
color = "#1B909E",
alpha = 0.4
) +
labs(
title = "Distribution of lending rate",
x = "Lending rate",
y = "Count"
) +
theme_minimal()
```
## Data cleaning and feature engineering
The `lendingclub_dat` object resembles a typical data frame, but it's actually a SQL table. With the [dbplyr package](https://dbplyr.tidyverse.org/), we can use dplyr syntax to query and manipulate these SQL tables directly in R.
Certain columns in `lendingclub_dat` are imported as characters instead of their intended numeric type. With dplyr's `mutate()` function, we can convert them to the correct type. The chunk below showcases the conversion of columns containing the word `annual` using both dplyr and SQL syntax. The `show_query()` function shows the underlying SQL query. This example highlights the efficiency of dplyr’s syntax in R, offering a more concise alternative to verbose SQL queries.
::: {.panel-tabset}
## R syntax
```{r}
#| label: dplyr-syntax
#| code-fold: false
#| eval: false
tbl(con,
dbplyr::in_catalog("hive_metastore", "default", "lendingclub")) |>
mutate(across(c(starts_with("annual")), ~ as.numeric(.)))
```
## SQL syntax
```{r}
#| label: sql-syntax
#| code-fold: false
tbl(con,
dbplyr::in_catalog("hive_metastore", "default", "lendingclub")) |>
mutate(across(c(starts_with("annual")), ~ as.numeric(.))) |>
show_query()
```
:::
Now, we can proceed with using dplyr to clean up the rest of the variables and create supplementary variables that could potentially influence lending rates. Running operations before pulling the data into R with `collect()` leads to significantly faster performance.
```{r}
#| label: pre-collect-data-clean
lendingclub_dat <- tbl(
con,
dbplyr::in_catalog("hive_metastore", "default", "lendingclub")
) |>
mutate(
# Convert columns into numeric
across(
c(
starts_with("annual"), starts_with("dti"), starts_with("inq"), starts_with("mo"), starts_with("mths"), starts_with("num"), starts_with("open"), starts_with("percent"), starts_with("pct"), starts_with("revol"), starts_with("tot"), "all_util", "il_util", "tax_liens", "loan_amnt", "installment", "pub_rec_bankruptcies", "num_tl_120dpd_2m", "bc_util", "max_bal_bc", "bc_open_to_buy", "acc_open_past_24mths", "avg_cur_bal", "delinq_2yrs", "pub_rec"
),
~ as.numeric(.)
),
# Calculate loan to income statistic
loan_to_income = if_else(
application_type == "Individual",
loan_amnt / annual_inc,
loan_amnt / annual_inc_joint
),
# Calculate percentage of monthly income the installment payment represents
installment_pct_inc = if_else(
application_type == "Individual",
installment / (annual_inc / 12),
installment / (annual_inc_joint / 12)
),
# Calculate percentage of monthly income the installment payment represents
adjusted_dti = if_else(
application_type == "Individual",
(loan_amnt + tot_cur_bal) / (annual_inc),
(loan_amnt + tot_cur_bal) / (annual_inc_joint)
),
# Calculate utilization on installment accounts excluding mortgage balance
il_util_ex_mort = if_else(
total_il_high_credit_limit > 0,
total_bal_ex_mort / total_il_high_credit_limit,
0
),
# Fill debt to income joint with individual debt to income where missing
dti_joint = coalesce(dti_joint, dti),
# Fill annual income joint with individual annual income where missing
annual_inc_joint = coalesce(annual_inc_joint, annual_inc)
) |>
collect()
```
After running `collect()`, when we have our data frame in R, we gain access to additional advanced R functions that are not available in SQL. This enables us to further refine and clean our data as needed.
```{r}
#| label: post-collect-data-clean
lendingclub_dat_clean <- lendingclub_dat |>
mutate(
# Missing values for these columns seem most appropriate to fill with zero
across(
c(
"inq_fi", "dti", "all_util", "percent_bc_gt_75", "il_util", "avg_cur_bal", "all_util", "il_util", "inq_last_6mths", "inq_last_12m", "num_tl_120dpd_2m", "open_il_12m", "open_il_24m", "open_rv_12m", "open_rv_24m"
),
~ replace_na(., 0)
),
# Missing values for these columns seem most appropriate to fill with column max
across(
c(
"mo_sin_old_il_acct", "mths_since_last_major_derog", "mths_since_last_delinq", "mths_since_recent_bc", "mths_since_last_record", "mths_since_rcnt_il", "mths_since_recent_bc", "mths_since_recent_bc_dlq", "mths_since_recent_inq", "mths_since_recent_revol_delinq", "mths_since_recent_revol_delinq"
),
~ replace_na(., max(., na.rm = TRUE))
),
# Remove percent sign
int_rate = as.numeric(str_remove(int_rate, "%")),
# Remove percent sign
revol_util = as.numeric(str_remove(revol_util, "%")),
# Create variable for earliest line of credit
earliest_cr_line = parse_date_time2(
paste("01", earliest_cr_line, sep = "-"), "dmy",
cutoff_2000 = 50L
),
# Calculate time since earliest line of credit
age_earliest_cr = interval(
as.Date(earliest_cr_line),
as.Date(today())
) %/% days(1),
# Convert characters to factors
across(where(is.character), .fns = as.factor),
term = trimws(term)
)
```
With our data cleaned up, it's time to examine the variables we intend to include in the model. We create vectors for conveniently referencing variable categories later if necessary. Then, we compile all the desired columns into one vector named `all_vars` for use in our model.
```{r}
#| label: create-variable-vectors
applicant_numeric <- c(
"annual_inc", "dti", "age_earliest_cr", "loan_amnt", "installment"
)
applicant_text <- c("emp_title", "title")
applicant_categorical <- c("application_type", "emp_length", "term")
credit_numeric <- c(
"acc_open_past_24mths", "avg_cur_bal", "bc_open_to_buy", "bc_util",
"delinq_2yrs", "open_acc", "pub_rec", "revol_bal", "tot_coll_amt",
"tot_cur_bal", "total_acc", "total_rev_hi_lim", "num_accts_ever_120_pd",
"num_actv_bc_tl", "num_actv_rev_tl", "num_bc_sats", "num_bc_tl", "num_il_tl",
"num_rev_tl_bal_gt_0", "pct_tl_nvr_dlq", "percent_bc_gt_75", "tot_hi_cred_lim",
"total_bal_ex_mort", "total_bc_limit", "total_il_high_credit_limit",
"total_rev_hi_lim", "all_util", "loan_to_income", "installment_pct_inc",
"il_util", "il_util_ex_mort", "total_bal_il", "total_cu_tl"
)
NUMERIC_VARS_QB_20 <- c(
"inq_last_6mths", "mo_sin_old_il_acct", "mo_sin_old_rev_tl_op",
"mo_sin_old_rev_tl_op", "mo_sin_rcnt_tl", "mort_acc", "num_op_rev_tl",
"num_rev_accts", "num_sats", "pub_rec", "pub_rec_bankruptcies", "tax_liens",
"all_util", "loan_to_income"
)
NUMERIC_VARS_QB_5 <- c("num_tl_120dpd_2m")
NUMERIC_VARS_QB_10 <- c(
"mths_since_last_delinq", "mths_since_last_major_derog",
"mths_since_last_record", "mths_since_rcnt_il", "mths_since_recent_bc",
"mths_since_recent_bc_dlq", "mths_since_recent_inq",
"mths_since_recent_revol_delinq", "num_tl_90g_dpd_24m", "num_tl_op_past_12m"
)
NUMERIC_VARS_QB_50 <- c(
"installment", "bc_open_to_buy", "loan_amnt", "total_bc_limit", "percent_bc_gt_75"
)
mean_impute_vals <- c(
"bc_util", "num_rev_accts", "bc_open_to_buy", "percent_bc_gt_75",
"total_bal_il", "total_il_high_credit_limit", "total_cu_tl"
)
all_vars <- c(
applicant_numeric, applicant_categorical, credit_numeric,
NUMERIC_VARS_QB_20, NUMERIC_VARS_QB_5, NUMERIC_VARS_QB_10, NUMERIC_VARS_QB_50
)
```
Next, we proceed to select only the relevant columns for our model and remove any missing values in our `int_rate` variable.
```{r}
#| label: create-model-dataset
lendingclub_dat_cols <-
lendingclub_dat_clean |>
select(int_rate, all_of(all_vars)) |>
filter(!is.na(int_rate))
```
To ensure our dataset is free of any missing values, we can run the below:
```{r}
#| label: check-nas
#| code-fold: false
colSums(is.na(lendingclub_dat_cols))
```
We can ensure that no factor columns have fewer than two factors by running the below:
```{r}
#| label: check-factors
#| code-fold: false
lendingclub_dat_cols |>
select(where(is.factor)) |>
select(where(~ nlevels(.) < 2))
```
## Model creation
With our dataset prepared, we can begin the modeling process. This involves creating train and test datasets, which we'll use to train and evaluate our model.
```{r}
#| label: create-test-train
set.seed(1234)
train_test_split <- initial_split(lendingclub_dat_cols)
lend_train <- training(train_test_split)
lend_test <- testing(train_test_split)
```
Using tidymodels, we can construct a "recipe" detailing the decisions we want to make for our data:
* `step_normalize()`: This recipe step normalizes numeric data to have a standard deviation of one and a mean of zero. Given that our dataset contains numeric values of various units (e.g., dollars, months), normalization ensures that variables with larger value ranges do not disproportionately influence the model.
* `step_impute_mean()`: This recipe step replaces missing values of numeric variables with the mean of those values in the training set. Performing this step now, rather than during the data cleaning phase, prevents information leakage from the testing set into the model. By calculating the mean solely using the training set, we avoid bias when applying it to the testing data during model evaluation.
```{r}
#| label: create-recipe
rec_obj <- recipe(int_rate ~ ., data = lend_train) |>
step_normalize(applicant_numeric, credit_numeric) |>
step_impute_mean(mean_impute_vals)
```
To verify that the recipe functions as intended, we can run `prep(rec_obj, lend_train) |> bake(newdata = NULL)`. This will display the data that the workflow will provide to the model.
```{r}
#| label: display-workflow-data
#| warning: false
#| message: false
prep(rec_obj, lend_train) |>
bake(new_data = NULL)
```
Now, we proceed with a linear model. `linear_reg()` specifies a model capable of predicting numeric values from predictors using a linear function. We construct a workflow incorporating our recipe alongside the linear model.
```{r}
#| label: run-linear-workflow
lend_linear <- linear_reg()
lend_linear_wflow <-
workflow() |>
add_model(lend_linear) |>
add_recipe(rec_obj)
```
Now, we can fit our model using `fit()` along with the training dataset.
```{r}
#| label: fit-linear-workflow
lend_linear_fit <-
lend_linear_wflow |>
fit(data = lend_train)
```
Below are our predicted lending rates:
```{r}
#| label: predict-linear-workflow
#| warning: false
predict(lend_linear_fit, lend_test)
```
Now, let's evaluate our performance. We can analyze our results by calculating the coefficient of determination, which is 0.6 in this case. While this indicates some level of explanatory power, it suggests that our model's estimate may not be very robust.
```{r}
#| label: rsq-linear-workflow
#| warning: false
lend_linear_results <-
bind_cols(predict(lend_linear_fit, lend_train)) |>
bind_cols(lend_train |>
select(int_rate))
rsq(lend_linear_results, truth = int_rate, estimate = .pred)
```
However, one of the advantages of tidymodels is the ability to quickly pivot with the steps we've already executed. Instead of a linear model, we can now define a random forest model.
```{r}
#| label: set-ranger-workflow
lend_rand <- rand_forest(mode = "regression") |>
set_engine("ranger", importance = "permutation")
```
Then, we can pass that model into our workflow and proceed to fit the model.
<!---The random forest model takes several minutes to run. To expedite document rendering, I've saved the model in a pin and called it from the chunk below. -->
```{r}
#| label: pull-ranger-pin
#| include: false
board <-
board_connect(
auth = "manual",
server = Sys.getenv("CONNECT_SERVER"),
key = Sys.getenv("CONNECT_API_KEY")
)
lend_ranger_fit <-
pin_read(
board = board,
name = "isabella.velasquez@posit.co/lend_ranger_fit"
)
```
```{r}
#| label: run-fit-ranger-workflow
#| eval: false
lend_ranger_wflow <-
workflow() |>
add_model(lend_rand) |>
add_recipe(rec_obj)
lend_ranger_fit <-
lend_ranger_wflow |>
fit(data = lend_train)
```
Recalculating the coefficient of determination, we get:
```{r}
#| label: rsq-ranger-workflow
lend_ranger_results <-
bind_cols(predict(lend_ranger_fit, lend_train)) |>
bind_cols(lend_train |> select(int_rate))
rsq(lend_ranger_results, truth = int_rate, estimate = .pred)
```
Now, that's a significant improvement!
Given the number of variables in our model, it's natural to want to identify the most important ones. We can compute variable importance by running `vi` to achieve this.
```{r}
#| label: vi-ranger-workflow
vip:::vi(lend_ranger_fit) |>
arrange(desc(Importance))
```
According to this analysis, the most important variables are "term", "installment_pct_inc", "bc_open_to_buy", "installment", and "percent_bc_gt_75".
# Model logging and artifact storage
If we aim to log our model for use in other contexts, like a Shiny app, we can utilize the [vetiver package]. This tool facilitates deploying and maintaining machine learning models in production, allowing us to store models in a pin for convenient access and reference.
Let's reconstruct our model using only the top five variables of importance.
```{r}
#| label: select-ranger-workflow
#| eval: false
imp_var <- c("term", "installment_pct_inc", "bc_open_to_buy", "installment", "percent_bc_gt_75")
lendingclub_dat_cols_select <-
lendingclub_dat_cols |>
select(int_rate, all_of(imp_var))
train_test_split_select <- initial_split(lendingclub_dat_cols_select)
lend_train_select <- training(train_test_split_select)
lend_test_select <- testing(train_test_split_select)
rec_obj_select <- recipe(int_rate ~ ., data = lend_train_select) |>
step_normalize(c("installment", "installment_pct_inc", "percent_bc_gt_75")) |>
step_impute_mean("bc_open_to_buy") |>
step_other()
lend_select_wflow <-
workflow() |>
add_model(lend_rand) |>
add_recipe(rec_select_obj)
lend_select_fit <-
lend_select_wflow |>
fit(data = lend_train_select)
```
We can create a vetiver object `v` to store the trained model.
```{r}
#| label: create-vetiver-obj
#| eval: false
v <- vetiver_model(lend_select_fit, "lend_fit")
```
We can deploy the model by creating a special Plumber router in R with the plumber package. We add a POST endpoint for making predictions. Following that, we connect to the destination where we intend to store our model artifact. In our case, this would be Posit Connect. Then, we save the vetiver model to a pin for future access.
```{r}
#| label: add-post-endpoint
#| eval: false
board <-
board_connect(
auth = "manual",
server = Sys.getenv("CONNECT_SERVER"),
key = Sys.getenv("CONNECT_API_KEY")
)
board |> vetiver_pin_write(v)
```
Now, we can deploy the model to Posit Connect to generate predictions as needed.
```{r}
#| label: deploy-vetiver-model
#| eval: false
vetiver_deploy_rsconnect(
board = board,
name = "isabella.velasquez@posit.co/lend_fit",
predict_args = list(debug = TRUE)
)
```
Now that our model is available through an API, we can use it in other places -- such as a Shiny app!
# Interactive Shiny app
Since our model is pinned in an API, it can be retrieved by other applications, such as a Shiny app. Within the `server` section of our Shiny app, we retrieve the data by connecting to our pin board:
```{r}
#| label: connect-shiny-app
#| eval: false
#| code-fold: false
board <-
board_connect(server = Sys.getenv("CONNECT_SERVER"),
key = Sys.getenv("CONNECT_API_KEY"))
```
Access the endpoint URL with `vetiver_endpoint()`:
```{r}
#| label: access-shiny-endpoint
#| code-fold: false
#| eval: false
url <- "https://pub.demo.posit.team/public/lend-fit-vetiver-api/"
endpoint <- vetiver_endpoint(url)
```
Also in the server portion of the Shiny app, establish a reactive variable (named `predictions_df` here) that generates a tibble from the user's inputs and passes it to the endpoint.
```{r}
#| label: make-shiny-predictions
#| eval: false
#| code-fold: false
predictions_df <- reactive({
req(
input$select_term,
input$input_installment_pct_inc,
input$input_bc_open_to_buy,
input$input_installment,
input$percent_bc_gt_75
)
pred_tibble <-
tibble(
term = input$select_term,
installment_pct_inc = input$input_installment_pct_inc,
bc_open_to_buy = input$input_bc_open_to_buy,
installment = input$input_installment,
percent_bc_gt_75 = input$percent_bc_gt_75
)
url <- "https://pub.demo.posit.team/public/lend-fit-vetiver-api/predict"
endpoint <- vetiver_endpoint(url)
api_key <- Sys.getenv("CONNECT_API_KEY")
predictions <- predict(
endpoint,
pred_tibble,
httr::add_headers(Authorization = paste("Key", api_key))
)
})
```
Below is our Shiny app for predicting lending rates. Each time you select an input, it calculates the rate based on our random forest model. The design and layout of the Shiny app is developed with [bslib](https://rstudio.github.io/bslib/).
```{r}
#| label: predit-shiny-app
#| eval: false
library(shiny)
library(bslib)
library(bsicons)
library(dplyr)
library(vetiver)
tidymodels::tidymodels_prefer(quiet = TRUE)
cards <- list(
card(full_screen = TRUE,
card_header(HTML("<br><br><br>Term of loan:")),
card_body(
selectInput(
inputId = "select_term",
choices = c("36 months", "60 months"),
selected = "36 months",
label = HTML("<br>Select 36 or 60 months:")
)
)),
card(full_screen = TRUE,
card_header("Proportion of monthly income represented by the installment payment:"),
card_body(
numericInput(
inputId = "input_installment_pct_inc",
label = HTML("Input a number between<br>0 and 1:"),
value = 0.07,
min = 0,
max = 1,
step = 0.01
)
)),
card(full_screen = TRUE,
card_header(HTML("<br><br>Open to buy on revolving bankcards:")),
card_body(
numericInput(
inputId = "input_bc_open_to_buy",
label = HTML("Input a number between<br>0 and 500,000:"),
value = 14467,
min = 0,
max = 500000,
step = 1000
)
)),
card(full_screen = TRUE,
card_header(HTML("<br><br><br>Installment payment:")),
card_body(
numericInput(
inputId = "input_installment",
label = HTML("Input a number between<br>30 and 2000:"),
value = 463,
min = 30,
max = 2000,
step = 100
)
)),
card(full_screen = TRUE,
card_header(HTML("<br><br>Percentage of all bankcard accounts > 75")),
card_body(
numericInput(
inputId = "input_percent_bc_gt_75",
label = HTML("<br>Input a number<br>between 0 and 100:"),
value = 50,
min = 0,
max = 100,
step = 5
)
))
)
vbs <- list(
value_box(
title = "Predicted interest rate",
value = textOutput("pred_int"),
style = "background-color: #082D46!important; color: #FFFFFF!important",
showcase = bsicons::bs_icon("bank", size = "0.75em"),
showcase_layout = "top right",
full_screen = FALSE,
fill = TRUE,
height = NULL
)
)
foot <-
tags$div(
style = "background-color: #FFFFFF; padding: 0px; text-align: center; bottom: 0; width: 100%;",
HTML(
"Powered by <a href='https://posit.co'><img src='https://www.rstudio.com/assets/img/posit-logo-fullcolor-TM.svg' alt='Posit Logo' style='width:55px;'></a> | Integrated with <a href='https://www.databricks.com'><img src='https://cdn.cookielaw.org/logos/29b588c5-ce77-40e2-8f89-41c4fa03c155/bc546ffe-d1b7-43af-9c0b-9fcf4b9f6e58/1e538bec-8640-4ae9-a0ca-44240b0c1a20/databricks-logo.png' alt='Databricks Logo' style='width:85px;'></a>. For more details, see our <a href='https://posit.co/blog/databricks-and-posit-announce-new-integrations/' target='_blank'>blog post</a> announcing the partnership."
)
)
ui <- bslib::page(
title = "Interest rate prediction app",
layout_columns(width = 1/5,
height = 275,
cards[[2]], cards[[3]], cards[[4]], cards[[5]], cards[[1]]),
layout_columns(vbs[[1]]),
card_footer(foot)
)
server <- function(input, output, session) {
board <-
board_connect(server = Sys.getenv("CONNECT_SERVER"),
key = Sys.getenv("CONNECT_API_KEY"))
predictions_df <- reactive({
req(input$select_term, input$input_installment_pct_inc, input$input_bc_open_to_buy, input$input_installment, input$input_percent_bc_gt_75)
pred_tibble <-
tibble(
term = input$select_term,
installment_pct_inc = input$input_installment_pct_inc,
bc_open_to_buy = input$input_bc_open_to_buy,
installment = input$input_installment,
percent_bc_gt_75 = input$input_percent_bc_gt_75
)
url <-
"https://pub.demo.posit.team/public/lend-fit-vetiver-api/predict"
endpoint <- vetiver_endpoint(url)
apiKey = Sys.getenv("CONNECT_API_KEY")
predictions <-
predict(endpoint,
pred_tibble,
httr::add_headers(Authorization = paste("Key", apiKey)))
})
output$pred_int <- renderText({
predictions_df()$.pred
})
}
shinyApp(ui, server)
```
:::{.column-page}
<iframe src="https://pub.demo.posit.team/public/lend-rate-prediction-app/" width="100%" height="500px"></iframe>
:::
In summary, we've gone on a comprehensive journey to predict lending rates using machine learning techniques within the context of financial analysis. We began by accessing our data from Databricks, cleaning our data, and using tidymodels for modeling and evaluation. We developed two predictive models, evaluated their performance, and identified key variables driving interest rate predictions.
Leveraging the power of vetiver, we deployed our model for seamless integration into production environments, such as Shiny apps. This holistic approach can help enhance decision-making processes within the financial domain.
# Learn more about the Databricks x Posit partnership
We believe our products are better together. Learn more about our partnership.
* Visit the [Databricks x Posit Solutions Page](https://posit.co/solutions/databricks/).
* View our [co-presented webinar](https://www.youtube.com/watch?v=iShpyDxzMeE) where we talked about improved productivity for your data teams.
* [Schedule a demo](https://posit.co/schedule-a-call/?booking_calendar__c=Databricks).