-
-
Notifications
You must be signed in to change notification settings - Fork 2
/
Copy pathmultiple-regression.qmd
406 lines (309 loc) · 22 KB
/
multiple-regression.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
# Multiple Regression {#sec-multipleRegression}
## Getting Started {#sec-multipleRegressionGettingStarted}
### Load Packages {#sec-multipleRegressionLoadPackages}
```{r}
library("petersenlab")
library("tidyverse")
library("knitr")
```
## Overview of Multiple Regression {#sec-multipleRegressionOverview}
Multiple regression examines the association between multiple [predictor variables](#sec-correlationalStudy) and one [outcome variable](#sec-correlationalStudy).
It allows obtaining a more accurate estimate of the unique contribution of a given [predictor variable](#sec-correlationalStudy), by controlling for other variables ([covariates](#sec-covariates)).
Regression with one [predictor variable](#sec-correlationalStudy) takes the form of @eq-regression:
$$
y = \beta_0 + \beta_1x_1 + \epsilon
$$ {#eq-regression}
where $y$ is the [outcome variable](#sec-correlationalStudy), $\beta_0$ is the intercept, $\beta_1$ is the slope, $x_1$ is the [predictor variable](#sec-correlationalStudy), and $\epsilon$ is the error term.
A regression line is depicted in @fig-regression.
```{r}
#| include: false
#| eval: false
set.seed(52242)
regression <- data.frame(outcome = rnorm(40, mean = 5, sd = 2))
regression$predictor <- complement(regression$outcome, .5)
regression$predictor <- regression$predictor + abs(min(regression$predictor))
lm(
outcome ~ predictor,
data = regression)
ggplot2::ggplot(
data = regression,
aes(
x = predictor,
y = outcome,
)
) +
geom_point() +
geom_smooth(
method = "lm",
linewidth = 2,
se = FALSE,
fullrange = TRUE) +
scale_x_continuous(
lim = c(0,8),
breaks = seq(from = 0, to = 8, by = 2),
expand = c(0,0)
) +
scale_y_continuous(
lim = c(0,8),
breaks = seq(from = 0, to = 8, by = 2),
expand = c(0,0)
) +
labs(
x = "Predictor Variable",
y = "Outcome Variable",
title = "Regression Best-Fit Line"
) +
theme_classic(
base_size = 16) +
theme(legend.title = element_blank())
ggsave("./images/regression.pdf", width = 6, height = 6)
```
::: {#fig-regression}
![](images/regression.png){fig-alt="A Regression Best-Fit Line."}
A Regression Best-Fit Line.
:::
Regression with multiple predictors—i.e., multiple regression—takes the form of @eq-multipleRegression:
$$
y = \beta_0 + \beta_1x_1 + \beta_2x_2 + ... + \beta_px_p + \epsilon
$$ {#eq-multipleRegression}
where $p$ is the number of [predictor variables](#sec-correlationalStudy).
## Components {#sec-multipleRegressionComponents}
- $B$ = unstandardized coefficient: direction and magnitude of the estimate (original scale)
- $\beta$ (beta) = standardized coefficient: direction and magnitude of the estimate (standard deviation scale)
- $SE$ = standard error: uncertainty of unstandardized estimate
The unstandardized regression coefficient ($B$) is interpreted such that, for every unit change in the [predictor variable](#sec-correlationalStudy), there is a __ unit change in the [outcome variable](#sec-correlationalStudy).
For instance, when examining the association between age and fantasy points, if the unstandardized regression coefficient is 2.3, players score on average 2.3 more points for each additional year of age.
(In reality, we might expect a nonlinear, inverted-U-shaped association between age and fantasy points such that players tend to reach their peak in the middle of their careers.)
Unstandardized regression coefficients are tied to the metric of the raw data.
Thus, a large unstandardized regression coefficient for two variables may mean completely different things.
Holding the strength of the association constant, you tend to see larger unstandardized regression coefficients for variables with smaller units and smaller unstandardized regression coefficients for variables with larger units.
Standardized regression coefficients can be obtained by standardizing the variables to [*z*-scores](#sec-zScores) so they all have a mean of zero and standard deviation of one.
The standardized regression coefficient ($\beta$) is interpreted such that, for every standard deviation change in the [predictor variable](#sec-correlationalStudy), there is a __ standard deviation change in the [outcome variable](#sec-correlationalStudy).
For instance, when examining the association between age and fantasy points, if the standardized regression coefficient is 0.1, players score on average 0.1 standard deviation more points for each additional standard deviation of their year of age.
Standardized regression coefficients—though not the case in all instances—tend to fall between [−1, 1].
Thus, standardized regression coefficients tend to be more comparable across variables and models compared to unstandardized regression coefficients.
In this way, standardized regression coefficients provide a meaningful index of [effect size](#sec-practicalSignificance).
## Assumptions of Multiple Regression {#sec-assumptionsRegression}
Linear regression models make the following assumptions:
- there is a linear association between the predictor variables and the outcome variable
- there is homoscedasticity of the residuals; the residuals do not differ as a function of the predictor variables or as a function of the outcome variable
- the residuals are independent; they are uncorrelated with each other
- the residuals are normally distributed
## Coefficient of Determination ($R^2$) {#sec-multipleRegressionRSquared}
The coefficient of determination ($R^2$) reflects the proportion of variance in the [outcome (dependent) variable](#sec-correlationalStudy) that is explained by the model predictions: $R^2 = \frac{\text{variance explained in }Y}{\text{total variance in }Y}$.
Various formulas for $R^2$ are in @eq-rSquared.
Larger $R^2$ values indicate greater accuracy.
Multiple regression can be conceptualized with overlapping circles (similar to a venn diagram), where the non-overlapping portions of the circles reflect nonshared variance and the overlapping portions of the circles reflect shared variance, as in @fig-regression.
::: {#fig-regression}
![](images/multipleRegressionRSquared.png){width=50% fig-alt="Conceptual Depiction of Proportion of Variance Explained ($R^2$) in an Outcome Variable ($Y$) by Multiple Predictors ($X1$ and $X2$) in Multiple Regression. The size of each circle represents the variable's variance. The proportion of variance in $Y$ that is explained by the predictors is depicted by the areas in orange. The dark orange space ($G$) is where multiple predictors explain overlapping variance in the outcome. Overlapping variance that is explained in the outcome ($G$) will not be recovered in the regression coefficients when both predictors are included in the regression model. From @Petersen2024a and @PetersenPrinciplesPsychAssessment."}
Conceptual Depiction of Proportion of Variance Explained ($R^2$) in an Outcome Variable ($Y$) by Multiple Predictors ($X1$ and $X2$) in Multiple Regression. The size of each circle represents the variable's variance. The proportion of variance in $Y$ that is explained by the predictors is depicted by the areas in orange. The dark orange space ($G$) is where multiple predictors explain overlapping variance in the outcome. Overlapping variance that is explained in the outcome ($G$) will not be recovered in the regression coefficients when both predictors are included in the regression model. From @Petersen2024a and @PetersenPrinciplesPsychAssessment.
:::
One issue with $R^2$ is that it increases as the number of predictors increases, which can lead to [overfitting](#sec-overfitting) if using $R^2$ as an index to compare models for purposes of selecting the "best-fitting" model.
Consider the following example (adapted from @PetersenPrinciplesPsychAssessment) in which you have one [predictor variable](#sec-correlationalStudy) and one [outcome variable](#sec-correlationalStudy), as shown in @tbl-regression1.
```{r}
#| echo: false
regression1 <- data.frame(
"y" = c(7, 13, 29, 10),
"x1" = c(1, 2, 7, 2))
regression2 <- data.frame(
"y" = c(7, 13, 29, 10),
"x1" = c(1, 2, 7, 2),
"x2" = c(3, 5, 1, 2))
regression1_model <- lm(
y ~ x1,
data = regression1)
regression1_intercept <- apa(regression1_model$coefficients[[1]], decimals = 2)
regression1_slope <- apa(regression1_model$coefficients[[2]], decimals = 2)
regression1_rsquare <- apa(summary(regression1_model)$r.squared, decimals = 2)
regression2_model <- lm(y ~ x1 + x2, data = regression2)
regression2_intercept <- apa(regression2_model$coefficients[[1]], decimals = 2)
regression2_slope1 <- apa(regression2_model$coefficients[[2]], decimals = 2)
regression2_slope2 <- apa(regression2_model$coefficients[[3]], decimals = 2)
regression2_rsquare <- apa(summary(regression2_model)$r.squared, decimals = 2)
```
```{r}
#| label: tbl-regression1
#| tbl-cap: "Example Data of Predictor (x1) and Outcome (y) Used for Regression Model."
#| echo: false
kable(
regression1,
col.names = c("y","x1"),
booktabs = TRUE)
```
Using the data, the best fitting regression model is: $y =$ `{r} regression1_intercept` $+$ `{r} regression1_slope` $\cdot x_1$.
In this example, the $R^2$ is `{r} regression1_rsquare`.
The equation is not a perfect prediction, but with a single [predictor variable](#sec-correlationalStudy), it captures the majority of the variance in the outcome.
Now consider the following example where you add a second [predictor variable](#sec-correlationalStudy) to the data above, as shown in @tbl-regression2.
```{r}
#| label: tbl-regression2
#| tbl-cap: "Example Data of Predictors (x1 and x2) and Outcome (y) Used for Regression Model."
#| echo: false
kable(
regression2,
col.names = c("y","x1","x2"),
booktabs = TRUE)
```
With the second [predictor variable](#sec-correlationalStudy), the best fitting regression model is: $y =$ `{r} regression2_intercept` + `{r} regression2_slope1` $\cdot x_1 +$ `{r} regression2_slope2` $\cdot x_2$.
In this example, the $R^2$ is `{r} regression2_rsquare`.
The equation with the second [predictor variable](#sec-correlationalStudy) provides a perfect prediction of the outcome.
Providing perfect prediction with the right set of [predictor variable](#sec-correlationalStudy)s is the dream of multiple regression.
So, using multiple regression, we often add [predictor variables](#sec-correlationalStudy) to incrementally improve prediction.
Knowing how much variance would be accounted for by random chance follows @eq-predictionByChance:
$$
E(R^2) = \frac{K}{n-1}
$$ {#eq-predictionByChance}
where $E(R^2)$ is the expected value of $R^2$ (the proportion of variance explained), $K$ is the number of [predictor variables](#sec-correlationalStudy), and $n$ is the sample size.
The formula demonstrates that the more [predictor variables](#sec-correlationalStudy) in the regression model, the more variance will be accounted for by chance.
With many [predictor variables](#sec-correlationalStudy) and a small sample, you can account for a large share of the variance merely by chance.
As an example, consider that we have 13 [predictor variables](#sec-correlationalStudy) to predict fantasy performance for 43 players.
Assume that, with 13 [predictor variables](#sec-correlationalStudy), we explain 38% of the variance ($R^2 = .38; r = .62$).
We explained a lot of the variance in the outcome, but it is important to consider how much variance could have been explained by random chance: $E(R^2) = \frac{K}{n-1} = \frac{13}{43 - 1} = .31$.
We expect to explain 31% of the variance, by chance, in the outcome.
So, 82% of the variance explained was likely spurious (i.e., $\frac{.31}{.38} = .82$).
As the sample size increases, the spuriousness decreases.
To account for the number of [predictor variables](#sec-correlationalStudy) in the model, we can use a modified version of $R^2$ called adjusted $R^2$ ($R^2_{adj}$).
Adjusted $R^2$ ($R^2_{adj}$) accounts for the number of [predictor variables](#sec-correlationalStudy) in the model, based on how much would be expected to be accounted for by chance to penalize [overfitting](#sec-overfitting).
Adjusted $R^2$ ($R^2_{adj}$) reflects the proportion of variance in the [outcome (dependent) variable](#sec-correlationalStudy) that is explained by the model predictions over and above what would be expected to be accounted for by chance, given the number of [predictor variables](#sec-correlationalStudy) in the model.
The formula for adjusted $R^2$ ($R^2_{adj}$) is in @eq-adjustedRSquared:
$$
R^2_{adj} = 1 - (1 - R^2) \frac{n - 1}{n - p - 1}
$$ {#eq-adjustedRSquared}
where $p$ is the number of [predictor variables](#sec-correlationalStudy) in the model, and $n$ is the sample size.
## Overfitting {#sec-overfitting}
Statistical models applied to big data (e.g., data with many [predictor variables](#sec-correlationalStudy)) can *overfit* the data, which means that the statistical model accounts for error variance, which will not generalize to future samples.
So, even though an overfitting statistical model appears to be accurate because it is accounting for more variance, it is not actually that accurate—it will predict new data less accurately than how accurately it accounts for the data with which the model was built.
In the case of fantasy football analytics, this is especially relevant because there are hundreds if not thousands of variables we could consider for inclusion and many, many players when considering historical data.
Consider an example where you develop an algorithm to predict players' fantasy performance based on 2023 data using hundreds of [predictor variables](#sec-correlationalStudy).
To some extent, these [predictor variables](#sec-correlationalStudy) will likely account for true variance (i.e., signal) and error variance (i.e., noise).
If we were to apply the same algorithm based on the 2023 prediction model to 2024 data, the prediction model would likely predict less accurately than with 2023 data.
The regression coefficients in the
In @fig-overfittingModel, the blue line represents the true distribution of the data, and the red line is an overfitting model:
```{r}
#| label: fig-overfittingModel
#| fig-cap: "Over-fitting Model in Red Relative to the True Distribution of the Data in Blue. From @Petersen2024a and @PetersenPrinciplesPsychAssessment."
#| fig-alt: "Over-fitting Model in Red Relative to the True Distribution of the Data in Blue. From @Petersen2024a and @PetersenPrinciplesPsychAssessment."
#| code-fold: true
set.seed(52242)
sampleSize <- 200
quadraticX <- rnorm(sampleSize)
quadraticY <- quadraticX ^ 2 + rnorm(sampleSize)
quadraticData <- cbind(quadraticX, quadraticY) %>%
data.frame %>%
arrange(quadraticX)
quadraticModel <- lm(
quadraticY ~ quadraticX + I(quadraticX ^ 2),
data = quadraticData)
quadraticNewData <- data.frame(
quadraticX = seq(
from = min(quadraticData$quadraticX),
to = max(quadraticData$quadraticY),
length.out = sampleSize))
quadraticNewData$quadraticY <- predict(
quadraticModel,
newdata = quadraticNewData)
loessFit <- loess(
quadraticY ~ quadraticX,
data = quadraticData,
span = 0.01,
degree = 1)
loessNewData <- data.frame(
quadraticX = seq(
from = min(quadraticData$quadraticX),
to = max(quadraticData$quadraticY),
length.out = sampleSize))
quadraticNewData$loessY <- predict(
loessFit,
newdata = quadraticNewData)
plot(
x = quadraticData$quadraticX,
y = quadraticData$quadraticY,
xlab = "",
ylab = "")
lines(
quadraticNewData$quadraticY ~ quadraticNewData$quadraticX,
lwd = 2,
col = "blue")
lines(
quadraticNewData$loessY ~ quadraticNewData$quadraticX,
lwd = 2,
col = "red")
```
## Covariates {#sec-covariates}
Covariates are variables that you include in the statistical model to try to control for them so you can better isolate the unique contribution of the [predictor variable](#sec-correlationalStudy)(s) in relation to the [outcome variable](#sec-correlationalStudy).
Use of covariates examines the association between the [predictor variable](#sec-correlationalStudy) and the [outcome variable](#sec-correlationalStudy) when holding people's level constant on the covariates.
Inclusion of confounds as covariates allows potentially gaining a more accurate estimate of the causal effect of the [predictor variable](#sec-correlationalStudy) on the [outcome variable](#sec-correlationalStudy).
Ideally, you want to include any and all confounds as covariates.
As described in @sec-correlationCausation, confounds are third variables that influence both the [predictor variable](#sec-correlationalStudy) and the [outcome variable](#sec-correlationalStudy) and explain their association.
Covariates are potentially (but not necessarily) confounds.
For instance, you might include the player's age as a covariate in a model that examines whether a player's 40-yard dash time at the NFL Combine predicts their fantasy points in their rookie year, but it may not be a confound.
## Multicollinearity {#sec-multipleRegressionMulticollinearity}
*Multicollinearity* occurs when two or more [predictor variables](#sec-correlationalStudy) in a regression model are highly correlated.
The problem of having multiple [predictor variables](#sec-correlationalStudy) that are highly correlated is that it makes it challenging to estimate the regression coefficients accurately.
Multicollinearity in multiple regression is depicted conceptually in @fig-regression.
::: {#fig-regression}
![](images/multipleRegressionMulticollinearity.png){width=50% fig-alt="Conceptual Depiction of Multicollinearity in Multiple Regression. From @Petersen2024a and @PetersenPrinciplesPsychAssessment."}
Conceptual Depiction of Multicollinearity in Multiple Regression. From @Petersen2024a and @PetersenPrinciplesPsychAssessment.
:::
Consider the following example adapted from @PetersenPrinciplesPsychAssessment where you have two [predictor variables](#sec-correlationalStudy) and one [outcome variable](#sec-correlationalStudy), as shown in @tbl-regression3.
```{r}
#| echo: false
regression3 <- data.frame(
"y" = c(9, 11, 17, 3, 21, 13),
"x1" = c(2, 3, 4, 1, 5, 3.5),
"x2" = c(4, 6, 8, 2, 10, 7))
```
```{r}
#| label: tbl-regression3
#| tbl-cap: "Example Data of Predictors (x1 and x2) and Outcome (y) Used for Regression Model."
#| echo: false
kable(
regression3,
col.names = c("y","x1","x2"),
booktabs = TRUE)
```
The second [predictor variable](#sec-correlationalStudy) is not very good—it is exactly twice the value of the first [predictor variable](#sec-correlationalStudy); thus, the two [predictor variables](#sec-correlationalStudy) are perfectly correlated (i.e., $r = 1.0$).
This means that there are different prediction equation possibilities that are equally good—see Equations in @eq-multicollinearity:
$$
\begin{aligned}
2x_2 &= y \\
0x_1 + 2x_2 &= y \\
4x_1 &= y \\
4x_1 + 0x_2 &= y \\
2x_1 + 1x_2 &= y \\
5x_1 - 0.5x_2 &= y \\
...
&= y
\end{aligned}
$$ {#eq-multicollinearity}
Then, what are the regression coefficients?
We do not know what are the correct regression coefficients because each of the possibilities fits the data equally well.
Thus, when estimating the regression model, we could obtain arbitrary estimates of the regression coefficients with an enormous standard error around each estimate.
In general, multicollinearity increases the uncertainty (i.e., standard errors and confidence intervals) around the parameter estimates.
Any [predictor variables](#sec-correlationalStudy) that have a correlation above ~ $r = .30$ with each other could have an impact on the confidence interval of the regression coefficient.
As the correlations among the [predictor variables](#sec-correlationalStudy) increase, the chance of getting an arbitrary answer increases, sometimes called "bouncing betas."
So, it is important to examine a correlation matrix of the [predictor variables](#sec-correlationalStudy) before putting them in the same regression model.
You can also examine indices such as variance inflation factor (VIF).
To address multicollinearity, you can drop a redundant predictor or you can also use principal component analysis or factor analysis of the predictors to reduce the predictors down to a smaller number of meaningful predictors.
For a meaningful answer in a regression framework that is precise and confident, you need a low level of intercorrelation among predictors, unless you have a very large sample size.
## Impact of Oultiers {#sec-multipleRegressionOutliers}
[As with correlation](#sec-correlationOutliers), multiple regression can be strongly impacted by outliers.
## Moderated Multiple Regression {#sec-moderatedMultipleRegression}
When examining moderation in multiple regression, several steps are important:
- When computing the interaction term, first mean-center the predictor variables.
Calculate the interaction term as the multiplication of the mean-centered predictor variables.
Mean-centering the predictor variables when computing the interaction term is important for addressing issues regarding [multicollinearity](#sec-multipleRegressionMulticollinearity) [@Iacobucci2016].
- When including an interaction term in the model, make sure also to include the main effects.
## Mediation {#sec-multipleRegressionMediation}
## Bayesian Multiple Regression {#sec-multipleRegressionBayesian}
## Conclusion {#sec-multipleRegressionConclusion}
Multiple regression allows examining the association between multiple [predictor variables](#sec-correlationalStudy) and one [outcome variable](#sec-correlationalStudy).
Inclusion of multiple predictors in the model allows for potentially greater predictive accuracy and identification of the extent to which each variable uniquely contributes to the outcome variable.
As with [correlation](#sec-correlation), an association does not imply causation.
However, identifying associations is important because associations are a necessary (but insufficient) condition for causality.
When developing a multiple regression model, it is important to pay attention for potential [multicollinearity](#sec-multipleRegressionMulticollinearity)—it may become difficult to detect a given [predictor variable](#sec-correlationalStudy) as [statistically significant](#sec-statisticalSignificance) due to the greater uncertainty around the parameter estimates.
::: {.content-visible when-format="html"}
## Session Info {#sec-multipleRegressionSessionInfo}
```{r}
sessionInfo()
```
:::