-
Notifications
You must be signed in to change notification settings - Fork 173
/
inf-model-mlr.qmd
837 lines (709 loc) · 43.3 KB
/
inf-model-mlr.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
# Inference for linear regression with multiple predictors {#sec-inf-model-mlr}
\chaptermark{Inference for regression with multiple predictors}
```{r}
#| include: false
source("_common.R")
```
::: {.chapterintro data-latex=""}
In [Chapter -@sec-model-mlr], the least squares regression method was used to estimate linear models which predicted a particular response variable given more than one explanatory variable.
Here, we discuss whether each of the variables individually is a statistically discernible predictor of the outcome or whether the model might be just as strong without that variable.
That is, as before, we apply inferential methods to ask whether a variable could have come from a population where the particular coefficient at hand was zero.
If one of the linear model coefficients is truly zero (in the population), then the estimate of the coefficient (using least squares) will vary around zero.
The inference task at hand is to decide whether the coefficient's difference from zero is large enough to decide that the data cannot possibly have come from a model where the true population coefficient is zero.
Both the derivations from the mathematical model and the randomization model are beyond the scope of this book, but we are able to calculate p-values using statistical software.
We will discuss interpreting p-values in the multiple regression setting and note some scenarios where careful understanding of the context and the relationship between variables is important.
We use cross-validation as a method for independent assessment of the multiple linear regression model.
:::
```{r}
#| include: false
terms_chp_25 <- c("inference on multiple linear regression")
```
## Multiple regression output from software {#sec-inf-mult-reg-soft}
Recall the `loans` data from [Chapter -@sec-model-mlr].
::: {.data data-latex=""}
The [`loans_full_schema`](http://openintrostat.github.io/openintro/reference/loans_full_schema.html) data can be found in the [**openintro**](http://openintrostat.github.io/openintro) R package.
Based on the data in this dataset we have created two new variables: `credit_util` which is calculated as the total credit utilized divided by the total credit limit and `bankruptcy` which turns the number of bankruptcies to an indicator variable (0 for no bankruptcies and 1 for at least 1 bankruptcies).
We will refer to this modified dataset as `loans`.
:::
\clearpage
Now, our goal is to create a model where `interest_rate` can be predicted using the variables `debt_to_income`, `term`, and `credit_checks`.
As you learned in @sec-model-mlr, least squares can be used to find the coefficient estimates for the linear model.
The unknown population model can be written as:
$$
\begin{aligned}
E[\texttt{interest\_rate}] = \beta_0 &+ \beta_1\times \texttt{debt\_to\_income} \\
&+ \beta_2 \times \texttt{term}\\
&+ \beta_3 \times \texttt{credit\_checks}\\
\end{aligned}
$$
```{r}
#| label: tbl-loansmodel
#| tbl-cap: |
#| Summary of a linear model for predicting interest rate based on
#| `debt_to_income`, `term`, and `credit_checks`. Each of the variables
#| has its own coefficient estimate as well as a p-value.
#| tbl-pos: H
loans <- loans_full_schema |>
mutate(
credit_util = total_credit_utilized / total_credit_limit,
bankruptcy = as.factor(if_else(public_record_bankrupt == 0, 0, 1)),
verified_income = droplevels(verified_income)
) |>
rename(credit_checks = inquiries_last_12m) |>
select(interest_rate, verified_income, debt_to_income, credit_util, bankruptcy, term, credit_checks, issue_month)
lm(interest_rate ~ debt_to_income + term + credit_checks, data = loans) |>
tidy() |>
mutate(p.value = ifelse(p.value < 0.0001, "<0.0001", round(p.value, 4))) |>
kbl(
linesep = "", booktabs = TRUE,
digits = 2, align = "lrrrr"
) |>
kable_styling(
bootstrap_options = c("striped", "condensed"),
latex_options = c("striped")
) |>
column_spec(1, width = "10em", monospace = TRUE) |>
column_spec(2:5, width = "5em")
```
The estimated equation for the regression model may be written as a model with three predictor variables:
$$
\widehat{\texttt{interest\_rate}} = 4.31 + 0.041 \times \texttt{debt\_to\_income} + 0.16 \times \texttt{term} + 0.25 \times \texttt{credit\_checks}
$$
Not only does @tbl-loansmodel provide the estimates for the coefficients, it also provides information on the inference analysis (i.e., hypothesis testing) which is the focus of this chapter.
In @sec-inf-model-slr, you learned that the hypothesis test for a linear model with **one predictor**[^25-inf-model-mlr-1] can be written as: \index{variable!predictor}\index{predictor variable}
[^25-inf-model-mlr-1]: In previous sections, the term **explanatory variable**\index{variable!explanatory}\index{explanatory variable} was used instead of **predictor**.
The words are synonymous and are used separately in the different sections to be consistent with how most analysts use them: explanatory variable for testing, predictor for modeling.
```{r}
#| include: false
terms_chp_25 <- c(terms_chp_25, "predictor")
```
> if only one predictor, $H_0: \beta_1 = 0.$
That is, if the true population slope is zero, the p-value measures how likely it would be to select data which produced the observed slope ($b_1$) value.
With **multiple predictors**\index{variable!multiple predictors}\index{multiple predictors}, the hypothesis is similar, however, it is now conditioned on each of the other variables remaining in the model.
```{r}
#| include: false
terms_chp_25 <- c(terms_chp_25, "multiple predictors")
```
> if multiple predictors, $H_0: \beta_i = 0$ given other variables in the model
Using the example above and focusing on each of the variable p-values (here we won't discuss the p-value associated with the intercept), we can write out the three different hypotheses:
- $H_0: \beta_1 = 0$, given `term` and `credit_checks` are included in the model
- $H_0: \beta_2 = 0$, given `debt_to_income` and `credit_checks` are included in the model
- $H_0: \beta_3 = 0$, given `debt_to_income` and `term` are included in the model
The very low p-values from the software output tell us that each of the variables acts as an important predictor in the model, despite the inclusion of the other two.
Consider the p-value on $H_0: \beta_1 = 0$.
The low p-value says that it would be extremely unlikely to see data that produce a coefficient on `debt_to_income` as large as 0.041 if the true relationship between `debt_to_income`and `interest_rate` was non-existent (i.e., if $\beta_1 = 0$) and the model also included `term` and `credit_checks`.
You might have thought that the value 0.041 is a small number (i.e., close to zero), but in the units of the problem, 0.041 turns out to be far away from zero, it's all about context!
The p-values on `term` and on `credit_checks` are interpreted similarly.
Sometimes a set of predictor variables can impact the model in unusual ways, often due to the predictor variables themselves being correlated.
\clearpage
## Multicollinearity {#sec-inf-mult-reg-collin}
In practice, there will almost always be some degree of correlation between the explanatory variables in a multiple regression model.
For regression models, it is important to understand the entire context of the model, particularly for correlated variables.
Our discussion will focus on interpreting coefficients (and their signs) in relationship to other variables as well as the discernibility (i.e., the p-value) of each coefficient.
Consider an example where we would like to predict how much money is in a coin dish based only on the number of coins in the dish.
We ask 26 students to tell us about their individual coin dishes, collecting data on the total dollar amount, the total number of coins, and the total number of low coins.[^25-inf-model-mlr-2]
The number of low coins is the number of coins minus the number of quarters (a quarter is the largest commonly used US coin, at US\$0.25).
@fig-money illustrates a sample of U.S. coins, their total worth (`total_amount`), the total `number_of_coins`, and the `number_of_low_coins`.
[^25-inf-model-mlr-2]: In all honesty, this particular dataset is fabricated, and the original idea for the problem comes from Jeff Witmer at Oberlin College.
```{r}
#| label: fig-money
#| fig-cap: |
#| A sample of coins with 16 total coins, 10 low coins, and a net worth of \$1.90.
#| fig-alt: |
#| Graphic of five pennies, three nickles, two dimes, and six quarters. A summary
#| table indicates that there are 16 total coins, 10 of which are considered
#| low coins. The total amount of money is one dollar and ninety cents.
#| out-width: 65%
include_graphics("images/money.png")
```
The collected data is given in @fig-coinfig and shows that the `total_amount` of money is more highly correlated with the total `number_of_coins` than it is with the `number_of_low_coins`.
We also note that the total `number_of_coins` and the `number_of_low_coins` are positively correlated.
```{r}
#| label: fig-coinfig
#| fig-cap: |
#| Two plots describing the total amount of money (USD) as a function of the
#| total number of coins or low coins. As you might expect, the total amount
#| of money is more highly postively correlated with the total number of coins
#| than with the number of low coins.
#| fig-subcap:
#| - Total number of coins on the x-axis.
#| - Number of low coins on the x-axis.
#| fig-alt: |
#| Two scatterplots, both with total amount of money on the y-axis. The left
#| plot has total number of coins on the x-axis. The right plot has number of
#| low coins on the x-axis. Both plots show positive correlation, but the total
#| number of coins plot is more strongly correlated than the number of low coins
#| plot.
#| layout-ncol: 2
#| fig-width: 5
#| out-width: 100%
# Data are simulated
money <- tibble(
number_of_coins = c(9, 10, 3, 5, 10, 37, 28, 9, 11, 4, 6, 17, 15, 7, 9, 1, 5, 9, 36, 30, 47, 13, 5, 7, 18, 16),
number_of_low_coins = c(4, 8, 0, 4, 9, 34, 9, 3, 2, 2, 5, 12, 11, 4, 8, 0, 4, 9, 34, 9, 3, 2, 2, 5, 12, 11),
total_amount = c(1.37, 1.01, 1.5, 0.56, 0.61, 3.06, 5.42, 1.75, 5.4, 0.56, 0.34, 2.33, 3.34, 1.3, 1.2, 1.7, 0.86, 0.61, 2.96, 5.52, 8.95, 5.2, 1.56, 0.74, 1.83, 3.74)
)
ggplot(money, aes(y = total_amount, x = number_of_coins)) +
geom_point() +
labs(x = "Number of coins", y = "Total amount (USD)")
ggplot(money, aes(y = total_amount, x = number_of_low_coins)) +
geom_point() +
labs(x = "Number of low coins", y = "Total amount (USD)")
```
\clearpage
Using the total `number_of_coins` as the predictor variable, @tbl-coinhigh provides the least squares estimate of the coefficient is 0.13.
For every additional coin in the dish, we would predict that the student had US\$0.13 more.
The $b_1 = 0.13$ coefficient has a small p-value associated with it, suggesting we would not have seen data like this if `number_of_coins` and `total_amount` of money were not linearly related.
\vspace{-2mm}
$$\widehat{\texttt{total\_amount}} = 0.55 + 0.13 \times \texttt{number\_of\_coins}$$
```{r}
#| label: tbl-coinhigh
#| tbl-cap: |
#| Linear model output predicting the total amount of money based on the
#| total number of coins.
#| tbl-pos: H
lm(total_amount ~ number_of_coins, data = money) |>
tidy() |>
mutate(p.value = ifelse(p.value < .0001, "<0.0001", round(p.value, 4))) |>
kbl(
linesep = "", booktabs = TRUE,
digits = 2
) |>
kable_styling(
bootstrap_options = c("striped", "condensed"),
latex_options = c("striped")
) |>
column_spec(1, width = "10em", monospace = TRUE) |>
column_spec(2:5, width = "5em")
```
\vspace{-4mm}
Using `number_of_low_coins` as the predictor variable, @tbl-coinlow provides the least squares estimate of the coefficient is 0.02.
For every additional low coin in the dish, we would predict that the student had US\$0.02 more.
The $b_1 = 0.02$ coefficient has a large p-value associated with it, suggesting we could easily have seen data like ours even if the `number_of_low_coins` and `total_amount` of money are not at all linearly related.
\vspace{-2mm}
$$\widehat{\texttt{total\_amount}} = 2.28 + 0.02 \times \texttt{number\_of\_low\_coins}$$
\vspace{-5mm}
```{r}
#| label: tbl-coinlow
#| tbl-cap: |
#| Linear model output predicting the total amount of money based on the
#| number of low coins.
#| tbl-pos: H
lm(total_amount ~ number_of_low_coins, data = money) |>
tidy() |>
mutate(p.value = ifelse(p.value < .0001, "<0.0001", round(p.value, 4))) |>
kbl(
linesep = "", booktabs = TRUE,
digits = 2, align = "lrrrr"
) |>
kable_styling(
bootstrap_options = c("striped", "condensed"),
latex_options = c("striped")
) |>
column_spec(1, width = "10em", monospace = TRUE) |>
column_spec(2:5, width = "5em")
```
\vspace{-5mm}
::: {.workedexample data-latex=""}
Come up with an example of two observations that have the same number of low coins but the number of total coins differs by one.
What is the difference in total amount?
------------------------------------------------------------------------
Two samples of coins with the same number of low coins (3), but a different number of total coins (4 vs 5) and a different total amount (\$0.41 vs \$0.66).
```{r}
#| label: lowsame
#| fig-alt: |
#| A set of four coins is shown: one of each penny, nickle, dime, and quarter.
#| One quarter is then added to create a set of five coins -- one penny, one
#| nickle, one dime, and two quarters. The two sets of coins have the same
#| number of low coins but a different number of total coins.
#| out-width: 45%
#| fig-align: center
knitr::include_graphics("images/lowsame.png")
```
:::
\vspace{-5mm}
::: {.workedexample data-latex=""}
Come up with an example of two observations that have the same total number of coins but a different number of low coins.
What is the difference in total amount?
------------------------------------------------------------------------
Two samples of coins with the same total number of coins (4), but a different number of low coins (3 vs 4) and a different total amount (\$0.41 vs \$0.17).
```{r}
#| label: totalsame
#| fig-alt: |
#| A set of four coins is shown: one of each penny, nickle, dime, and quarter.
#| The quarter is then replaced with a penny to create a new set of four coins --
#| two pennies, one nickle, and one dime. The two sets of coins have the same
#| number of total coins but a different number of low coins.
#| out-width: 45%
#| fig-align: center
knitr::include_graphics("images/totalsame.png")
```
:::
\clearpage
Using both the total `number_of_coins` and the `number_of_low_coins` as predictor variables, @tbl-coinhighlow provides the least squares estimates of both coefficients as 0.21 and -0.16.
Now, with two variables in the model, the interpretation is more nuanced.
- A coefficient interpretation always indicates a change in one variable while keeping the other variable(s) constant.\
For every additional coin in the dish **while** the `number_of_low_coins` stays constant, we would predict that the student had US\$0.21 more. Re-considering the phrase "every additional coin in the dish **while** the number of low coins stays constant" makes us realize that each increase is a single additional quarter (larger samples sizes would have led to a $b_1$ coefficient closer to 0.25 because of the deterministic relationship described here).\
- For every additional low coin in the dish **while** the total `number_of_coins` stays constant, we would predict that the student had US\$0.16 less. Re-considering the phrase "every additional low coin in the dish **while** the number of total coins stays constant" makes us realize that a quarter is being swapped out for a penny, nickel, or dime.
Considering the coefficients across @tbl-coinhigh, @tbl-coinlow, and @tbl-coinhighlow within the context and knowledge we have of US coins allows us to understand the correlation between variables and why the signs of the coefficients would change depending on the model.
Note also, however, that the p-value for the `number_of_low_coins` coefficient changed from @tbl-coinlow to @tbl-coinhighlow.
It makes sense that the variable describing the `number_of_low_coins` provides more information about the `total_amount` of money when it is part of a model which also includes the total `number_of_coins` than it does when it is used as a single variable in a simple linear regression model.
$$
\begin{aligned}
\widehat{\texttt{total\_amount}} = 0.80 &+ 0.21 \times \texttt{number\_of\_coins} \\
&- 0.16 \times \texttt{number\_of\_low\_coins}
\end{aligned}
$$
```{r}
#| label: tbl-coinhighlow
#| tbl-cap: |
#| Linear model output predicting the total amount of money based on both
#| the total number of coins and the number of low coins.
#| tbl-pos: H
lm(total_amount ~ number_of_coins + number_of_low_coins, data = money) |>
tidy() |>
mutate(p.value = ifelse(p.value < .0001, "<0.0001", round(p.value, 4))) |>
kbl(
linesep = "", booktabs = TRUE,
digits = 2, align = "lrrrr"
) |>
kable_styling(
bootstrap_options = c("striped", "condensed"),
latex_options = c("striped")
) |>
column_spec(1, width = "10em", monospace = TRUE) |>
column_spec(2:5, width = "5em")
```
When working with multiple regression models, interpreting the model coefficients is not always as straightforward as it was with the coin example.
However, we encourage you to always think carefully about the variables in the model, consider how they might be correlated among themselves, and work through different models to see how using different sets of variables might produce different relationships for predicting the response variable of interest.
::: {.important data-latex=""}
**Multicollinearity.**
Multicollinearity\index{multicollinearity} happens when the predictor variables are correlated within themselves.
When the predictor variables themselves are correlated, the coefficients in a multiple regression model can be difficult to interpret.
:::
```{r}
#| include: false
terms_chp_25 <- c(terms_chp_25, "multicollinearity")
```
Although diving into the details are beyond the scope of this text, we will provide one more reflection about multicollinearity.
If the predictor variables have some degree of correlation, it can be quite difficult to interpret the value of the coefficient or evaluate whether the variable is a statistically discernible predictor of the outcome.
However, even a model that suffers from high multicollinearity will likely lead to unbiased predictions of the response variable.
So if the task at hand is only to do prediction (and not to interpret coefficients), multicollinearity is likely to not cause you substantial problems.
```{=html}
<!--
Things I didn't talk about but could:
* add R^2 and R^2_adj to the coin model
* For assessing model fit, as is necessary, we will create residual plots of residual vs. fitted (instead of residual vs. X as we saw previously).
* Give another example here with a quadratic relationship and the wonky things that can happen with the p-values.
-->
```
## Cross-validation for prediction error {#sec-inf-mult-reg-cv}
In @sec-inf-mult-reg-soft, p-values were calculated on each of the model coefficients.
The p-value gives a sense of which variables are important to the model; however, a more extensive treatment of variable selection is warranted in a follow-up course or textbook.
Here, we use **cross-validation**\index{cross-validation} **prediction error**\index{prediction error} to focus on which variable(s) are important for predicting the response variable of interest.
In general, linear models are also used to make predictions of individual observations.
In addition to model building, cross-validation provides a method for generating predictions that are not overfit to the particular dataset at hand.
We continue to encourage you to take up further study on the topic of cross-validation, as it is among the most important ideas in modern data analysis, and we are only able to scratch the surface here.
Cross-validation is a computational technique which removes some observations before a model is run, then assesses the model accuracy on the held-out sample.
By removing some observations, we provide ourselves with an independent evaluation of the model (that is, the removed observations do not contribute to finding the parameters which minimize the least squares equation).
Cross-validation can be used in many different ways (as an independent assessment), and here we will just scratch the surface with respect to one way the technique can be used to compare models.
See @fig-cv for a visual representation of the cross-validation process.
```{r}
#| label: fig-cv
#| fig-cap: |
#| The dataset is broken into k folds (here k = 4). One at a time, a model
#| is built using k-1 of the folds, and predictions are calculated on the
#| single held out sample which will be completely independent of the model
#| estimation.
#| fig-alt: |
#| A square represents the original observed data which has been partitioned
#| into four triangular segments. From the original partition, four different
#| settings are considered. The first setting is such that the red, green, and
#| yellow triangular representations of the data are used to build the model;
#| the blue triangular representation of the data is heldout and used for
#| independent model prediction. The second setting is such that the blue,
#| green, and yellow triangular representations of the data are used to build
#| the model; the red triangular representation of the data is heldout and used
#| for independent model prediction. The third setting is such that the red,
#| blue, and yellow triangular representations of the data are used to build
#| the model; the green triangular representation of the data is heldout and
#| used for independent model prediction. The fourth setting is such that the
#| red, green, and blue triangular representations of the data are used to
#| build the model; the yellow triangular representation of the data is
#| held out and used for independent model prediction.
#| fig-width: 10.0
include_graphics("images/CV.png")
```
```{r}
#| include: false
terms_chp_25 <- c(terms_chp_25, "cross-validation", "prediction error")
```
::: {.data data-latex=""}
The [`penguins`](https://allisonhorst.github.io/palmerpenguins/articles/intro.html) data can be found in the [**palmerpenguings**](https://github.com/allisonhorst/palmerpenguins) R package.
:::
Our goal in this section is to compare two different regression models which both seek to predict the mass of an individual penguin in grams.
The observations of three different penguin species include measurements on body size and sex.
The data were collected by [Dr. Kristen Gorman](https://www.uaf.edu/cfos/people/faculty/detail/kristen-gorman.php) and the [Palmer Station, Antarctica LTER](https://pal.lternet.edu/) as part of the [Long Term Ecological Research Network](https://lternet.edu/).
[@Gorman:2014] Although not exactly aligned with this research project, you might be able to imagine a setting where the dimensions of the penguin are known (through, for example, aerial photographs) but the mass is not known.
The first model predicts `body_mass_g` by using only the `bill_length_mm`, a variable denoting the length of a penguin's bill, in mm.
The second model predicts `body_mass_g` by using `bill_length_mm`, `bill_depth_mm`, `flipper_length_mm`, `sex`, and `species`.
::: {.important data-latex=""}
**Prediction error.**
The predicted error (also previously called the **residual**) is the difference between the observed value and the predicted value (from the regression model).
$$\text{prediction error}_i = e_i = y_i - \hat{y}_i$$
:::
The presentation below (see the comparison of @fig-peng-mass1 and @fig-peng-mass2) shows that the model with more variables predicts `body_mass_g` with much smaller errors (predicted minus actual body mass) than the model which uses only `bill_length_g`.
We have deliberately used a model that intuitively makes sense (the more body measurements, the more predictable mass is).
However, in many settings, it is not obvious which variables or which models contribute most to accurate predictions.
Cross-validation is one way to get accurate independent predictions with which to compare different models.
### Comparing two models to predict body mass in penguins
The question we will seek to answer is whether the predictions of `body_mass_g` are substantially better when `bill_length_mm`, `bill_depth_mm`, `flipper_length_mm`, `sex`, and `species` are used in the model, as compared with a model on `bill_length_mm` only.
We refer to the model given with only `bill_lengh_mm` as the **smaller** model.
It is seen in @tbl-peng-lm-bill with coefficient estimates of the parameters as well as standard errors and p-values.
We refer to the model given with `bill_lengh_mm`, `bill_depth_mm`, `flipper_length_mm`, `sex`, and `species` as the **larger** model.
It is seen in @tbl-peng-lm-all with coefficient estimates of the parameters as well as standard errors and p-values.
Given what we know about high correlations between body measurements, it is somewhat unsurprising that all of the variables have low p-values, suggesting that each variable is a statistically discernible predictor of `body_mass_g`, given all other variables in the model.
However, in this section, we will go beyond the use of p-values to consider independent predictions of `body_mass_g` as a way to compare the smaller and larger models.
\vspace{10mm}
**The smaller model:**
$$
\begin{aligned}
E[\texttt{body\_mass\_g}] &= \ \beta_0 + \beta_1 \times \texttt{bill\_length\_mm}\\
\widehat{\texttt{body\_mass\_g}} &= \ 362.31 + 87.42 \times \texttt{bill\_length\_mm}\\
\end{aligned}
$$
```{r}
#| label: tbl-peng-lm-bill
#| tbl-cap: |
#| Least squares estimates for the smaller regression model
#| predicting `body_mass_g` from `bill_length_mm`.
#| tbl-pos: H
lm(body_mass_g ~ bill_length_mm, data = penguins) |>
tidy() |>
mutate(p.value = ifelse(p.value < .0001, "<0.0001", round(p.value, 4))) |>
kbl(
linesep = "", booktabs = TRUE,
digits = 2, align = "lrrrr"
) |>
kable_styling(
bootstrap_options = c("striped", "condensed"),
latex_options = c("striped")
) |>
column_spec(1, width = "10em", monospace = TRUE) |>
column_spec(2:5, width = "5em")
```
**The larger model:**
$$
\begin{aligned}
E[\texttt{body\_mass\_g}] = \beta_0 &+ \beta_1 \times \texttt{bill\_length\_mm} \\
&+ \beta_2 \times \texttt{bill\_depth\_mm} \\
&+ \beta_3 \times \texttt{flipper\_length\_mm} \\
&+ \beta_4 \times \texttt{sex}\_{male} \\
&+ \beta_5 \times \texttt{species}\_{Chinstrap} \\
&+ \beta_6 \times \texttt{species}\_{Gentoo}\\
\widehat{\texttt{body\_mass\_g}} = -1460.99 &+ 18.20 \times \texttt{bill\_length\_mm} \\
&+ 67.22 \times \texttt{bill\_depth\_mm} \\
&+ 15.95 \times \texttt{flipper\_length\_mm} \\
&+ 389.89 \times \texttt{sex}\_{male} \\
&- 251.48 \times \texttt{species}\_{Chinstrap} \\
&+ 1014.63 \times \texttt{species}\_{Gentoo}\\
\end{aligned}
$$
```{r}
#| label: tbl-peng-lm-all
#| tbl-cap: |
#| Least squares estimates of the larger regression model
#| predicting `body_mass_g` from `bill_length_mm`, `bill_depth_mm`,
#| `flipper_length_mm`, `sex`, and `species`.
#| tbl-pos: H
lm(body_mass_g ~ bill_length_mm + bill_depth_mm + flipper_length_mm + sex + species, data = penguins) |>
tidy() |>
mutate(p.value = ifelse(p.value < .0001, "<0.0001", round(p.value, 4))) |>
kbl(
linesep = "", booktabs = TRUE,
digits = 2, align = "lrrrr"
) |>
kable_styling(
bootstrap_options = c("striped", "condensed"),
latex_options = c("striped")
) |>
column_spec(1, width = "10em", monospace = TRUE) |>
column_spec(2:5, width = "5em")
```
In order to compare the smaller and larger models in terms of their **ability to predict penguin mass**, we need to build models that can provide independent predictions based on the penguins in the holdout samples created by cross-validation.
To reiterate, each of the predictions that (when combined together) will allow us to distinguish between the smaller and larger are independent of the data which were used to build the model.
In this example, using cross-validation, we remove one quarter of the data before running the least squares calculations.
Then the least squares model is used to predict the `body_mass_g` of the penguins in the holdout sample.
Here we use a 4-fold cross-validation (meaning that one quarter of the data is removed each time) to produce four different versions of each model (other times it might be more appropriate to use 2-fold or 10-fold or even run the model separately after removing each individual data point one at a time).
@fig-massCV1 displays how a model is fit to 3/4 of the data (note the slight differences in coefficients as compared to @tbl-peng-lm-bill), and then predictions are made on the holdout sample.
```{r}
#| label: fig-massCV1
#| fig-cap: |
#| The smaller model. The coefficients are estimated using the least squares
#| model on 3/4 of the dataset with only a single predictor variable.
#| Predictions are made on the remaining 1/4 of the observations. The y-axis
#| in the scatterplot represents the residual, true observed value minus the
#| predicted value. Note that the predictions are independent of the
#| estimated model coefficients.
#| fig-alt: |
#| The left panel shows the linear model predicting body mass in grams using
#| bill length in mm; the model was built using the red, green, and yellow
#| triangular sections of the observed data. The right panel shows a
#| scatterplot of the prediction error versus the fitted values for the set
#| of observations in the blue triangular section of the observed data.
#| The prediction errors range from roughly -1,000 grams to +1,000 grams.
#| out-width: 90%
include_graphics("images/massCV1.png")
```
By repeating the process for each holdout quarter sample, the residuals from the model can be plotted against the predicted values.
We see that the predictions are scattered which shows a good model fit but that the prediction errors vary $\pm$ 1,000 g of the true body mass.
```{r}
#| label: fig-peng-mass1
#| fig-cap: |
#| The smaller model. One quarter at a time, the data were removed from the
#| model building, and the body mass of the removed penguins was predicted.
#| The least squares regression model was fit independently of the removed
#| penguins. The predictions of body mass are based on bill length only.
#| The x-axis represents the predicted value, the y-axis represents the error,
#| the difference between predicted value and actual value.
#| fig-alt: |
#| One quarter at a time, the data were removed from the model building,
#| and the body mass of the removed penguins was predicted. The least squares
#| regression model was fit independently of the removed penguins. The
#| predictions of body mass are based on bill length only. The x-axis
#| represents the predicted value, the y-axis represents the error, the
#| difference between predicted value and actual value). Four different
#| scatterplots are provided, one for each set of predictions (on a quarter
#| of the dataset). All four sets of error values are centered around zero
#| with most observations between -500 and + 500. Virtually all errors are
#| between -1,000 and +1,000.
#| out-width: 100%
#| fig-asp: 0.6
set.seed(47)
penguins <- penguins |> na.omit()
penguinfolds <- caret::createFolds(penguins$body_mass_g, k = 4)
data.frame(
predicted = lm(body_mass_g ~ bill_length_mm, data = penguins[-penguinfolds$Fold1, ]) |>
predict(newdata = penguins[penguinfolds$Fold1, c("bill_length_mm")]),
obs = penguins[penguinfolds$Fold1, ]$body_mass_g,
fold = rep("1st quarter", length(penguinfolds$Fold1))
) |>
rbind(data.frame(
predicted = lm(body_mass_g ~ bill_length_mm, data = penguins[-penguinfolds$Fold2, ]) |>
predict(newdata = penguins[penguinfolds$Fold2, c("bill_length_mm")]),
obs = penguins[penguinfolds$Fold2, ]$body_mass_g,
fold = rep("2nd quarter", length(penguinfolds$Fold2))
)) |>
rbind(data.frame(
predicted = lm(body_mass_g ~ bill_length_mm, data = penguins[-penguinfolds$Fold3, ]) |>
predict(newdata = penguins[penguinfolds$Fold3, c("bill_length_mm")]),
obs = penguins[penguinfolds$Fold3, ]$body_mass_g,
fold = rep("3rd quarter", length(penguinfolds$Fold3))
)) |>
rbind(data.frame(
predicted = lm(body_mass_g ~ bill_length_mm, data = penguins[-penguinfolds$Fold4, ]) |>
predict(newdata = penguins[penguinfolds$Fold4, c("bill_length_mm")]),
obs = penguins[penguinfolds$Fold4, ]$body_mass_g,
fold = rep("4th quarter", length(penguinfolds$Fold4))
)) |>
mutate(error = obs - predicted) |>
# mutate(error2 = error^2) |> summarize(sum(error2)) |> pull() |> format(scientific = FALSE)
ggplot(aes(x = predicted, y = error)) +
geom_point() +
geom_hline(yintercept = 0) +
geom_hline(yintercept = 500, lty = 2, color = "grey") +
geom_hline(yintercept = -500, lty = 2, color = "grey") +
facet_wrap(~fold) +
ylim(c(-1000, 1000)) +
labs(
x = "Predicted body mass (g)",
y = "Prediction error (g)"
)
```
```{r}
#| label: fig-peng-mass1-output
#| eval: false
# this code is used to create the powerpoint images above
massCV1 <- data.frame(
predicted = lm(body_mass_g ~ bill_length_mm, data = penguins[-penguinfolds$Fold1, ]) |>
predict(newdata = penguins[penguinfolds$Fold1, c("bill_length_mm")]),
obs = penguins[penguinfolds$Fold1, ]$body_mass_g,
fold = rep("1st quarter", length(penguinfolds$Fold1))
)
lm(body_mass_g ~ bill_length_mm, data = penguins[-penguinfolds$Fold1, ]) |>
tidy() |>
select(term, estimate) |>
mutate(estimate = round(estimate, 2))
massCV1 |>
mutate(error = obs - predicted) |>
ggplot(aes(x = predicted, y = error)) +
geom_point() +
geom_hline(yintercept = 0) +
geom_hline(yintercept = 500, lty = 2, color = "grey") +
geom_hline(yintercept = -500, lty = 2, color = "grey") +
facet_wrap(~fold) +
ylim(c(-1000, 1000)) +
labs(
x = "Predicted body mass (g)",
y = "Prediction rrror (g)"
)
```
The cross-validation SSE is the sum of squared error associated with the predictions.
Let $\hat{y}_{cv,i}$ be the prediction for the $i^{th}$ observation where the $i^{th}$ observation was in the hold-out fold and the other three folds were used to create the linear model.
For the model using only `bill_length_mm` to predict `body_mass_g`, the CV SSE is 141,552,822.
::: {.important data-latex=""}
**Cross-validation SSE.**
The prediction error from the cross-validated model can be used to calculate a single numerical summary of the model.
The cross-validation SSE is the sum of squared cross-validation prediction errors.
$$\mbox{CV SSE} = \sum_{i=1}^n (\hat{y}_{cv,i} - y_i)^2$$
:::
The same process is repeated for the larger number of explanatory variables.
Note that the coefficients estimated for the first cross-validation model (in @fig-massCV2) are slightly different from the estimates computed on the entire dataset (seen in @tbl-peng-lm-all).
@fig-massCV2 displays the cross-validation process for the multivariable model with a full set of residual plots given in @fig-peng-mass2.
Note that the residuals are mostly within $\pm$ 500g, providing much more precise predictions for the independent body mass values of the individual penguins.
```{r}
#| label: fig-massCV2
#| fig-cap: |
#| The larger model. The coefficients are estimated using the least squares
#| model on 3/4 of the dataset with the five specified predictor variables.
#| Predictions are made on the remaining 1/4 of the observations. The y-axis
#| in the scatterplot represents the residual, true observed value minus the
#| predicted value. Note that the predictions are independent of the estimated
#| model coefficients.
#| fig-alt: |
#| The left panel shows the linear model predicting body mass in grams using
#| bill length, bill depth, flipper length, sex, and species; the model was
#| built using the red, green, and yellow triangular sections of the observed
#| data. The right panel shows a scatterplot of the prediction error versus
#| the fitted values for the set of observations in the blue triangular
#| section of the observed data. The prediction errors range from roughly
#| -500 grams to +500 grams.
#| out-width: 90%
include_graphics("images/massCV2.png")
```
```{r}
#| label: fig-peng-mass2
#| fig-cap: |
#| The larger model. One quarter at a time, the data were removed from the
#| model building, and the body mass of the removed penguins was predicted.
#| The least squares regression model was fit independently of the removed
#| penguins. The predictions of body mass are based on the set of five
#| variables described in @tbl-peng-lm-all. The x-axis represents the
#| predicted value, the y-axis represents the error, the difference between
#| predicted value and actual value.
#| fig-alt: |
#| One quarter at a time, the data were removed from the model building, and
#| the body mass of the removed penguins was predicted. The least squares
#| regression model was fit independently of the removed penguins. The
#| predictions of body mass are based on the set of five variables described
#| in the larger model. The x-axis represents the predicted value, the y-axis
#| represents the error, the difference between predicted value and actual
#| value. Four different scatterplots are provided, one for each set of
#| predictions (on a quarter of the dataset). All four sets of error values
#| are centered around zero virtually all errors are between -500 and + 500.
#| out-width: 100%
#| fig-asp: 0.6
data.frame(
predicted = lm(body_mass_g ~ bill_length_mm + bill_depth_mm + flipper_length_mm + sex + species, data = penguins[-penguinfolds$Fold1, ]) |>
predict(newdata = penguins[penguinfolds$Fold1, c("bill_length_mm", "bill_depth_mm", "flipper_length_mm", "sex", "species")]),
obs = penguins[penguinfolds$Fold1, ]$body_mass_g,
fold = rep("1st quarter", length(penguinfolds$Fold1))
) |>
rbind(data.frame(
predicted = lm(body_mass_g ~ bill_length_mm + bill_depth_mm + flipper_length_mm + sex + species, data = penguins[-penguinfolds$Fold2, ]) |>
predict(newdata = penguins[penguinfolds$Fold2, c("bill_length_mm", "bill_depth_mm", "flipper_length_mm", "sex", "species")]),
obs = penguins[penguinfolds$Fold2, ]$body_mass_g,
fold = rep("2nd quarter", length(penguinfolds$Fold2))
)) |>
rbind(data.frame(
predicted = lm(body_mass_g ~ bill_length_mm + bill_depth_mm + flipper_length_mm + sex + species, data = penguins[-penguinfolds$Fold3, ]) |>
predict(newdata = penguins[penguinfolds$Fold3, c("bill_length_mm", "bill_depth_mm", "flipper_length_mm", "sex", "species")]),
obs = penguins[penguinfolds$Fold3, ]$body_mass_g,
fold = rep("3rd quarter", length(penguinfolds$Fold3))
)) |>
rbind(data.frame(
predicted = lm(body_mass_g ~ bill_length_mm + bill_depth_mm + flipper_length_mm + sex + species, data = penguins[-penguinfolds$Fold4, ]) |>
predict(newdata = penguins[penguinfolds$Fold4, c("bill_length_mm", "bill_depth_mm", "flipper_length_mm", "sex", "species")]),
obs = penguins[penguinfolds$Fold4, ]$body_mass_g,
fold = rep("4th quarter", length(penguinfolds$Fold4))
)) |>
mutate(error = obs - predicted) |>
# mutate(error2 = error^2) |> summarize(sum(error2)) |> pull() |> format(scientific = FALSE)
ggplot(aes(x = predicted, y = error)) +
geom_point() +
geom_hline(yintercept = 0) +
geom_hline(yintercept = 500, lty = 2, color = "grey") +
geom_hline(yintercept = -500, lty = 2, color = "grey") +
facet_wrap(~fold) +
ylim(c(-1000, 1000)) +
labs(
x = "Predicted body mass (g)",
y = "Prediction error (g)"
)
```
```{r}
#| label: peng-mass2-output
#| eval: false
# this code is used to create the powerpoint images above
massCV1 <- data.frame(
predicted = lm(body_mass_g ~ bill_length_mm + bill_depth_mm + flipper_length_mm + sex + species, data = penguins[-penguinfolds$Fold1, ]) |>
predict(newdata = penguins[
penguinfolds$Fold1,
c("bill_length_mm", "bill_depth_mm", "flipper_length_mm", "sex", "species")
]),
obs = penguins[penguinfolds$Fold1, ]$body_mass_g,
fold = rep("1st quarter", length(penguinfolds$Fold1))
)
lm(body_mass_g ~ bill_length_mm + bill_depth_mm + flipper_length_mm + sex + species, data = penguins[-penguinfolds$Fold1, ]) |>
tidy() |>
select(term, estimate) |>
mutate(estimate = round(estimate, 2))
massCV1 |>
mutate(error = obs - predicted) |>
ggplot(aes(x = predicted, y = error)) +
geom_point() +
geom_hline(yintercept = 0) +
geom_hline(yintercept = 500, lty = 2, color = "grey") +
geom_hline(yintercept = -500, lty = 2, color = "grey") +
facet_wrap(~fold) +
ylim(c(-1000, 1000)) +
labs(
x = "Predicted body mass (g)",
y = "Prediction error (g)"
) +
theme(text = element_text(size = 20))
```
\clearpage
@fig-peng-mass1 shows that the independent predictions are centered around the true values (i.e., errors are centered around zero), but that the predictions can be as much as 1,000 g off when using only `bill_length_mm` to predict `body_mass_g`.
On the other hand, when using `bill_length_mm`, `bill_depth_mm`, `flipper_length_mm`, `sex`, and `species` to predict `body_mass_g`, the prediction errors seem to be about half as big, as seen in @fig-peng-mass2.
For the model using `bill_length_mm`, `bill_depth_mm`, `flipper_length_mm`, `sex`, and `species` to predict `body_mass_g`, the CV SSE is 27,728,698 (as compared to a CV SSE of 141,552,822 for the smaller model).
Consistent with visually comparing the two sets of residual plots, the sum of squared prediction errors is smaller for the model which uses more predictor variables.
The model with more predictor variables seems like the better model (according to the cross-validated prediction errors criteria).
We have provided a very brief overview to and example of using cross-validation.
Cross-validation is a computational approach to model building and model validation as an alternative to reliance on p-values.
While p-values have a role to play in understanding model coefficients, throughout this text, we have continued to present computational methods that broaden statistical approaches to data analysis.
Cross-validation will be used again in @sec-inf-model-logistic with logistic regression.
We encourage you to consider both standard inferential methods (such as p-values) and computational approaches (such as cross-validation) as you build and use multivariable models of all varieties.
\clearpage
## Chapter review {#sec-chp25-review}
### Summary
Building on the modeling ideas from @sec-model-mlr, we have now introduced methods for evaluating coefficients (based on p-values) and evaluating models (cross-validation).
There are many important aspects to consider when working with multiple variables in a single model, and we have only glanced at a few topics.
Remember, multicollinearity can make coefficient interpretation difficult.
A topic not covered in this text but important for multiple regression models is interaction, and we hope that you learn more about how variables work together as you continue to build up your modeling skills.
### Terms
The terms introduced in this chapter are presented in @tbl-terms-chp-25.
If you're not sure what some of these terms mean, we recommend you go back in the text and review their definitions.
You should be able to easily spot them as **bolded text**.
```{r}
#| label: tbl-terms-chp-25
#| tbl-cap: Terms introduced in this chapter.
#| tbl-pos: H
make_terms_table(terms_chp_25)
```
\clearpage
## Exercises {#sec-chp25-exercises}
Answers to odd-numbered exercises can be found in [Appendix -@sec-exercise-solutions-25].
::: {.exercises data-latex=""}
{{< include exercises/_25-ex-inf-model-mlr.qmd >}}
:::