-
Notifications
You must be signed in to change notification settings - Fork 173
/
inf-model-logistic.qmd
654 lines (571 loc) · 31.1 KB
/
inf-model-logistic.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
# Inference for logistic regression {#sec-inf-model-logistic}
```{r}
#| include: false
source("_common.R")
```
::: {.chapterintro data-latex=""}
Combining ideas from [Chapter -@sec-model-logistic] on logistic regression, [Chapter -@sec-foundations-mathematical] on inference with mathematical models, and [Chapter -@sec-inf-model-slr] and [Chapter -@sec-inf-model-mlr] which apply inferential techniques to the linear model, we wrap up the book by considering inferential methods applied to a logistic regression model.
Additionally, we use cross-validation as a method for independent assessment of the logistic regression model.
:::
```{r}
#| include: false
terms_chp_26 <- c("inference on logistic regression")
```
As with multiple linear regression, the inference aspect for logistic regression will focus on interpretation of coefficients and relationships between explanatory variables.
Both p-values and cross-validation will be used for assessing a logistic regression model.
Consider the `email` data which describes email characteristics which can be used to predict whether a particular incoming email is spam (unsolicited bulk email).
Without reading every incoming message, it might be nice to have an automated way to identify spam emails.
Which of the variables describing each email are important for predicting the status of the email?
::: {.data data-latex=""}
The [`email`](http://openintrostat.github.io/openintro/reference/email.html) data can be found in the [**openintro**](http://openintrostat.github.io/openintro) R package.
:::
```{r}
#| label: tbl-resume-variables
#| tbl-cap: |
#| Variables and their descriptions for the `email` dataset. Many of the
#| variables are indicator variables, meaning they take the value 1 if the
#| specified characteristic is present and 0 otherwise.
email_variables <- tribble(
~variable, ~description,
"spam", "Indicator for whether the email was spam.",
"to_multiple", "Indicator for whether the email was addressed to more than one recipient.",
"from", "Whether the message was listed as from anyone (this is usually set by default for regular outgoing email).",
"cc", "Number of people cc'ed.",
"sent_email", "Indicator for whether the sender had been sent an email in the last 30 days.",
"attach", "The number of attached files.",
"dollar", "The number of times a dollar sign or the word “dollar” appeared in the email.",
"winner", "Indicates whether “winner” appeared in the email.",
"format", "Indicates whether the email was written using HTML (e.g., may have included bolding or active links).",
"re_subj", "Whether the subject started with “Re:”, “RE:”, “re:”, or “rE:”",
"exclaim_subj", "Whether there was an exclamation point in the subject.",
"urgent_subj", "Whether the word “urgent” was in the email subject.",
"exclaim_mess", "The number of exclamation points in the email message.",
"number", "Factor variable saying whether there was no number, a small number (under 1 million), or a big number."
)
email_variables |>
kbl(
linesep = "", booktabs = TRUE,
col.names = c("Variable", "Description")
) |>
kable_styling(
bootstrap_options = c("striped", "condensed"),
latex_options = c("striped"), full_width = TRUE
) |>
column_spec(1, monospace = TRUE) |>
column_spec(2, width = "30em")
```
## Model diagnostics
Before looking at the hypothesis tests associated with the coefficients (turns out they are very similar to those in linear regression!), it is valuable to understand the technical conditions that underlie the inference applied to the logistic regression model.
Generally, as you've seen in the logistic regression modeling examples, it is imperative that the response variable is binary.
Additionally, the key technical condition for logistic regression has to do with the relationship between the predictor variables $(x_i$ values) and the probability the outcome will be a success.
It turns out, the relationship is a specific functional form called a logit function, where ${\rm logit}(p) = \log_e(\frac{p}{1-p}).$ The function may feel complicated, and memorizing the formula of the logit is not necessary for understanding logistic regression.
What you do need to remember is that the probability of the outcome being a success is a function of a linear combination of the explanatory variables.
::: {.important data-latex=""}
**Logistic regression conditions.**
\index{technical conditions logistic regression}
There are two key conditions for fitting a logistic regression model:
1. Each outcome $Y_i$ is independent of the other outcomes.
2. Each predictor $x_i$ is linearly related to logit$(p_i)$ if all other predictors are held constant.
:::
```{r}
#| include: false
terms_chp_26 <- c(terms_chp_26, "technical conditions")
```
The first logistic regression model condition --- independence of the outcomes --- is reasonable if we can assume that the emails that arrive in an inbox within a few months are independent of each other with respect to whether they're spam or not.
The second condition of the logistic regression model is not easily checked without a fairly sizable amount of data.
Luckily, we have `r nrow(email)` emails in the dataset!
Let's first visualize these data by plotting the true classification of the emails against the model's fitted probabilities, as shown in @fig-spam-predict.
```{r}
#| label: fig-spam-predict
#| fig-cap: |
#| The predicted probability that each of the 3921 emails are spam. Points
#| have been jittered so that those with nearly identical values aren’t plotted
#| exactly on top of one another.
#| fig-alt: |
#| Scatterplot with predicted probability of being spam on the x-axis and
#| original class, either spam or not spam, on the y-axis. We can see that
#| the predictions do not perfectly classify the emails; however, the not spam
#| emails generally have lower prediction probabilities than the spam emails.
#| fig-asp: 0.43
#| fig-width: 7
spam_fit <- logistic_reg() |>
set_engine("glm") |>
fit(spam ~ to_multiple + cc + dollar + urgent_subj, data = email, family = "binomial")
spam_pred <- predict(spam_fit, new_data = email, type = "prob") |>
bind_cols(email |> select(spam))
ggplot(spam_pred, aes(x = .pred_1, y = spam)) +
geom_jitter(alpha = 0.2) +
coord_cartesian(xlim = c(0, 1)) +
labs(
y = NULL,
x = "Predicted probability that email is spam"
) +
scale_y_discrete(labels = c("Not spam (0)", "Spam (1)"))
```
We'd like to assess the quality of the model.
For example, we might ask: if we look at emails that we modeled as having 10% chance of being spam, do we find out 10% of them actually are spam?
We can check this for groups of the data by constructing a plot as follows:
1. Bucket the observations into groups based on their predicted probabilities.
2. Compute the average predicted probability for each group.
3. Compute the observed probability for each group, along with a 95% confidence interval for the true probability of success for those individuals.
4. Plot the observed probabilities (with 95% confidence intervals) against the average predicted probabilities for each group.
If the model does a good job describing the data, the plotted points should fall close to the line $y = x$, since the predicted probabilities should be similar to the observed probabilities.
We can use the confidence intervals to roughly gauge whether anything might be amiss.
Such a plot is shown in @fig-logisticModelBucketDiag.
```{r}
#| label: fig-logisticModelBucketDiag
#| fig-cap: |
#| A reconfiguration of @fig-spam-predict. Again, the predicted probabilities
#| are on the x-axis and the truth is on the y-axis for each email. After data
#| have been bucketed into predicted probability groups, the proportion of spam
#| emails (i.e., the observed probability) is given by the black circles. The
#| dashed line is within the confidence bound of the 95% confidence intervals
#| for many of the buckets, suggesting the logistic fit is reasonable.
#| fig-alt: |
#| Scatterplot with predicted probability of being spam on the x-axis and
#| original class, either spam or not spam, on the y-axis. Superimposed are
#| the observed probabilities of being spam, calculated by the proportion of
#| spam emails in each bucket of predicted probabilities. The observed
#| probabilities and predicted probabilities line up reasonably well.
#| out-width: 90%
#| fig-asp: 0.6
par_og <- par(no.readonly = TRUE) # save original par
par(mar = c(5.1, 7, 0, 0))
m <- glm(spam ~ to_multiple + cc + dollar + urgent_subj, data = email, family = binomial)
p <- predict(m, type = "response")
set.seed(1)
noise <- rnorm(nrow(email), sd = 0.08)
ns1 <- 4
plot(p, as.numeric(email$spam) - 1 + noise / 5,
type = "n",
xlim = 0:1,
ylim = c(-0.07, 1.07),
axes = FALSE,
xlab = "Predicted Probability",
ylab = ""
)
par(las = 0)
mtext("Truth", 2, 5.5)
par(las = 1)
rect(0, 0, 1, 1,
border = IMSCOL["gray", "full"],
col = "#00000000",
lwd = 1.5
)
lines(0:1, 0:1,
lty = 2,
col = IMSCOL["gray", "full"],
lwd = 1.5
)
points(p, as.numeric(email$spam) - 1 + noise / 5,
col = IMSCOL["blue", "f5"],
pch = 20
)
axis(1)
at <- seq(0, 1, length.out = 6)
labels <- c(
"0 (Not spam)",
"0.2 ",
"0.4 ",
"0.6 ",
"0.8 ",
"1 (Spam)"
)
axis(2, at, labels)
eps <- 1e-4
bucket_breaks <- quantile(p, seq(0, 1, 0.01))
bucket_breaks[1] <- bucket_breaks[1] - eps
n_buckets <- length(bucket_breaks) - 1
bucket_breaks[n_buckets] <- bucket_breaks[n_buckets] + 1e3 * eps
bucket_breaks. <- bucket_breaks
k <- 1
for (i in 1:n_buckets) {
if (abs(bucket_breaks.[i] - bucket_breaks[k]) >= 0.01) {
k <- k + 1
bucket_breaks[k] <- bucket_breaks.[i]
}
}
bucket_breaks <- bucket_breaks[1:k]
n_buckets <- length(bucket_breaks)
xp <- rep(NA, n_buckets)
yp <- rep(NA, n_buckets)
yp_lower <- rep(NA, n_buckets)
yp_upper <- rep(NA, n_buckets)
zs <- qnorm(0.975)
for (i in 1:n_buckets) {
these <- bucket_breaks[i] < p & p <= bucket_breaks[i + 1]
xp[i] <- mean(p[these])
# cat(paste("xp", "i", "=", xp[i]))
y <- (as.numeric(email$spam) - 1)[these]
yp[i] <- mean(y)
# cat(paste("yp", "i", "=", yp[i]))
yp_lower[i] <- yp[i] - zs * sqrt(yp[i] * (1 - yp[i]) / length(y))
# cat(paste("yp_lower", "i", "=", yp_lower[i]))
yp_upper[i] <- yp[i] + zs * sqrt(yp[i] * (1 - yp[i]) / length(y))
# cat(paste("yp_upper", "i", "=", yp_upper[i]))
}
points(xp, yp, pch = 19)
segments(xp, yp_lower, xp, yp_upper)
arrows(0.3, 0.17,
0.24, 0.22,
length = 0.07
)
text(0.3, 0.15,
paste("Observations are bucketed, then we compute",
"the observed probability in each bucket (y)",
"against the average predicted probability (x)",
"for each of the buckets with 95% confidence intervals.",
sep = "\n"
),
cex = 0.85, pos = 4
)
par(par_og) # restore original par
```
A plot like @fig-logisticModelBucketDiag helps to better understand the deviations.
Additional diagnostics may be created that are similar to those featured in @sec-tech-cond-linmod.
For instance, we could compute residuals as the observed outcome minus the expected outcome ($e_i = Y_i - \hat{p}_i$), and then we could create plots of these residuals against each predictor.
\index{logistic regression}
## Multiple logistic regression output from software {#sec-inf-log-reg-soft}
As you learned in @sec-model-mlr, optimization can be used to find the coefficient estimates for the logistic model.
The unknown population model can be written as:
$$
\begin{aligned}
\log_e\bigg(\frac{p}{1-p}\bigg) = \beta_0 &+ \beta_1 \times \texttt{to\_multiple}\\
&+ \beta_2 \times \texttt{cc} \\
&+ \beta_3 \times \texttt{dollar}\\
&+ \beta_4 \times \texttt{urgent\_subj}
\end{aligned}
$$
The estimated equation for the regression model may be written as a model with four predictor variables, where $\hat{p}$ is the estimated probability of being a spam email message:
$$
\begin{aligned}
\log_e\bigg(\frac{\hat{p}}{1-\hat{p}}\bigg) = -2.05 &-1.91 \times \texttt{to\_multiple}\\
&+ 0.02 \times \texttt{cc} \\
&- 0.07 \times \texttt{dollar}\\
&+ 2.66 \times \texttt{urgent\_subj}
\end{aligned}
$$
```{r}
#| label: tbl-emaillogmodel
#| tbl-cap: |
#| Summary of a logistic model for predicting whether an email is spam based
#| on the variables to_multiple, cc, dollar, and urgent_subj. Each of the
#| variables has its own coefficient estimate and p-value.
#| tbl-pos: H
glm(spam ~ to_multiple + cc + dollar + urgent_subj, data = email, family = "binomial") |>
tidy() |>
mutate(p.value = ifelse(p.value < .0001, "<0.0001", round(p.value, 4))) |>
kbl(
linesep = "", booktabs = TRUE,
digits = 2, align = "lrrrr"
) |>
kable_styling(
bootstrap_options = c("striped", "condensed"),
latex_options = c("striped")
) |>
column_spec(1, width = "15em", monospace = TRUE) |>
column_spec(2:5, width = "5em")
```
Not only does @tbl-emaillogmodel provide the estimates for the coefficients, it also provides information on the inference analysis (i.e., hypothesis testing) which is the focus of this chapter.
As in @sec-inf-model-mlr, with **multiple predictors**\index{variable!multiple predictors}\index{multiple predictors}, each hypothesis test (for each of the explanatory variables) is conditioned on each of the other variables remaining in the model.
```{r}
#| include: false
terms_chp_26 <- c(terms_chp_26, "multiple predictors")
```
> if multiple predictors $H_0: \beta_i = 0$ given other variables in the model
Using the example above and focusing on each of the variable p-values (here we won't discuss the p-value associated with the intercept), we can write out the four different hypotheses (associated with the p-value corresponding to each of the coefficients / rows in @tbl-emaillogmodel):
- $H_0: \beta_1 = 0$ given `cc`, `dollar`, and `urgent_subj` are included in the model
- $H_0: \beta_2 = 0$ given `to_multiple`, `dollar`, and `urgent_subj` are included in the model
- $H_0: \beta_3 = 0$ given `to_multiple`, `cc`, and `urgent_subj` are included in the model
- $H_0: \beta_4 = 0$ given `to_multiple`, `cc`, and `dollar` are included in the model
The very low p-values from the software output tell us that three of the variables (that is, not `cc`) act as statistically discernible predictors in the model at the discernibility level of 0.05, despite the inclusion of any of the other variables.
Consider the p-value on $H_0: \beta_1 = 0$.
The low p-value says that it would be extremely unlikely to observe data that yield a coefficient on `to_multiple` at least as far from 0 as -1.91 (i.e., $|b_1| > 1.91$) if the true relationship between `to_multiple` and `spam` was non-existent (i.e., if $\beta_1 = 0$) **and** the model also included `cc` and `dollar` and `urgent_subj`.
Note also that the coefficient on `dollar` has a small associated p-value, but the magnitude of the coefficient is also seemingly small (0.07).
It turns out that in units of standard errors (0.02 here), 0.07 is actually quite far from zero, it's all about context!
The p-values on the remaining variables are interpreted similarly.
From the initial output (p-values) in @tbl-emaillogmodel, it seems as though `to_multiple`, `dollar`, and `urgent_subj` are important variables for modeling whether an email is `spam`.
We remind you that although p-values provide some information about the importance of each of the predictors in the model, there are many, arguably more important, aspects to consider when choosing the best model.
As with linear regression (see @sec-inf-mult-reg-collin), existence of predictors that are correlated with each other can affect both the coefficient estimates and the associated p-values.
However, investigating multicollinearity in a logistic regression model is saved for a text which provides more detail about logistic regression.
Next, as a model building alternative (or enhancement) to p-values, we revisit cross-validation within the context of predicting status for each of the individual emails.
## Cross-validation for prediction error {#sec-inf-log-reg-cv}
The p-value is a probability measure under a setting of no relationship.
That p-value provides information about the degree of the relationship (e.g., above we measure the relationship between `spam` and `to_multiple` using a p-value), but the p-value does not measure how well the model will predict the individual emails (e.g., the accuracy of the model).
Depending on the goal of the research project, you might be inclined to focus on variable importance (through p-values) or you might be inclined to focus on prediction accuracy (through cross-validation).
Here we present a method for using **cross-validation**\index{cross-validation} **accuracy**\index{accuracy} to determine which variables (if any) should be used in a model which predicts whether an email is spam.
A full treatment of cross-validation and logistic regression models is beyond the scope of this text.
Using $k$-fold cross-validation, we can build $k$ different models which are used to predict the observations in each of the $k$ holdout samples.
As with linear regression (see @sec-inf-mult-reg-cv), we compare a smaller logistic regression model to a larger logistic regression model.
The smaller model uses only the `to_multiple` variable, see the complete dataset (not cross-validated) model output in @tbl-emaillogmodel1.
The logistic regression model can be written as follows, where $\hat{p}$ is the estimated probability of being a spam email message.
```{r}
#| include: false
terms_chp_26 <- c(terms_chp_26, "cross-validation", "accuracy")
```
**The smaller model:**
\vspace{-5mm}
$$\log_e\bigg(\frac{\hat{p}}{1-\hat{p}}\bigg) = -2.12 + -1.81 \times \texttt{to\_multiple}$$
```{r}
#| label: tbl-emaillogmodel1
#| tbl-cap: |
#| The smaller model. Summary of a logistic model for predicting whether an
#| email is spam based on only the predictor variable to_multiple. Each of
#| the variables has its own coefficient estimate and p-value.
#| tbl-pos: H
glm(spam ~ to_multiple, data = email, family = "binomial") |>
tidy() |>
mutate(p.value = ifelse(p.value < .0001, "<0.0001", round(p.value, 4))) |>
kbl(
linesep = "", booktabs = TRUE,
digits = 2, align = "lrrrr"
) |>
kable_styling(
bootstrap_options = c("striped", "condensed"),
latex_options = c("striped")
) |>
column_spec(1, width = "15em", monospace = TRUE) |>
column_spec(2:5, width = "5em")
```
For each cross-validated model, the coefficients change slightly, and the model is used to make independent predictions on the holdout sample.
The model from the first cross-validation sample is given in @fig-emailCV1 and can be compared to the coefficients in @tbl-emaillogmodel1.
```{r}
#| label: fig-emailCV1
#| fig-cap: |
#| The smaller model. The coefficients are estimated using the least squares
#| model on 3/4 of the data with a single predictor variable. Predictions
#| are made on the remaining 1/4 of the observations. Note that the prediction
#| error rate is quite high.
#| fig-alt: |
#| The left panel shows the logistic model predicting the probability of an
#| email being spam as a function of whether the email was sent to multiple
#| individuals; the model was built using the red, green, and yellow triangular
#| sections of the observed data. The right panel shows a confusion matrix of
#| the predicted spam label crossed with the observed spam label for the set of
#| observations in the blue triangular section of the observed data. Of the 83
#| spam emails, 80 were correctly classified as spam. Of the 897 non-spam
#| emails, 143 were correctly classified as not spam.
#| out-width: 90%
include_graphics("images/emailCV1.png")
```
```{r}
#| echo: false
set.seed(470)
data(email)
emailfolds <- caret::createFolds(email$spam, k = 4)
logCV1 <- data.frame(
predicted = glm(spam ~ to_multiple, data = email[-emailfolds$Fold1, ], family = "binomial") |>
predict(newdata = email[emailfolds$Fold1, c("to_multiple")], type = "response"),
obs = email[emailfolds$Fold1, ]$spam,
fold = rep("1st quarter", length(emailfolds$Fold1))
) |>
rbind(data.frame(
predicted = glm(spam ~ to_multiple, data = email[-emailfolds$Fold2, ], family = "binomial") |>
predict(newdata = email[emailfolds$Fold2, c("to_multiple")], type = "response"),
obs = email[emailfolds$Fold2, ]$spam,
fold = rep("2nd quarter", length(emailfolds$Fold2))
)) |>
rbind(data.frame(
predicted = glm(spam ~ to_multiple, data = email[-emailfolds$Fold3, ], family = "binomial") |>
predict(newdata = email[emailfolds$Fold3, c("to_multiple")], type = "response"),
obs = email[emailfolds$Fold3, ]$spam,
fold = rep("3rd quarter", length(emailfolds$Fold3))
)) |>
rbind(data.frame(
predicted = glm(spam ~ to_multiple, data = email[-emailfolds$Fold4, ], family = "binomial") |>
predict(newdata = email[emailfolds$Fold4, c("to_multiple")], type = "response"),
obs = email[emailfolds$Fold4, ]$spam,
fold = rep("4th quarter", length(emailfolds$Fold4))
)) |>
mutate(predspam = ifelse(predicted <= 0.1, 0, 1))
# logCV1 |>
# select(obs, predspam, fold) |>
# table()
# glm(spam ~ to_multiple,
# data = email[-emailfolds$Fold1,], family = "binomial") |> tidy() |>
# select(term, estimate) |>
# mutate(estimate = round(estimate,2))
```
```{r}
#| label: tbl-email-spam
#| tbl-cap: |
#| The smaller model. One quarter at a time, the data were removed from the
#| model building, and whether the email was spam (TRUE) or not (FALSE) was
#| predicted. The logistic regression model was fit independently of the
#| removed emails. Only `to_multiple` is used to predict whether the email is
#| spam. Because we used a cutoff designed to identify spam emails, the
#| accuracy of the non-spam email predictions is very low. `spamTP` is the
#| proportion of true spam emails that were predicted to be spam. `notspamTP`
#| is the proportion of true not spam emails that were predicted to be not spam.
#| tbl-pos: H
logCV1 |>
group_by(fold) |>
summarize(
count = n(),
accuracy = sum(obs == predspam) / n(),
notspamTP = sum(obs == 0 & predspam == 0) / sum(obs == 0),
spamTP = sum(obs == 1 & predspam == 1) / sum(obs == 1)
) |>
kbl(
linesep = "", booktabs = TRUE,
digits = 2
) |>
kable_styling(
bootstrap_options = c("striped", "condensed"),
latex_options = c("striped"), full_width = FALSE
)
```
Because the `email` dataset has a ratio of roughly 90% non-spam and 10% spam emails, a model which randomly guessed all non-spam would have an overall accuracy of 90%!
Clearly, we would like to capture the information with the spam emails, so our interest is in the percent of spam emails which are identified as spam (see @tbl-email-spam).
Additionally, in the logistic regression model, we use a 10% cutoff to predict whether the email is spam.
Fortunately, we have done a great job of predicting!
However, the trade-off was that most of the non-spam emails are now predicted to be spam which is not acceptable for a prediction algorithm.
Adding more variables to the model may help with both the spam and non-spam predictions.
The larger model uses `to_multiple`, `attach`, `winner`, `format`, `re_subj`, `exclaim_mess`, and `number` as the set of seven predictor variables, see the complete dataset (not cross-validated) model output in @tbl-emaillogmodel2.
The logistic regression model can be written as follows, where $\hat{p}$ is the estimated probability of being a spam email message.
\vspace{5mm}
**The larger model:**
$$
\begin{aligned}
\log_e\bigg(\frac{\hat{p}}{1-\hat{p}}\bigg) = -0.34 &- 2.56 \times \texttt{to\_multiple} + 0.20 \times \texttt{attach} + 1.73 \times \texttt{winner}_{yes} \\
&- 1.28 \times \texttt{format} - 2.86 \times \texttt{re\_subj} + 0.00 \times \texttt{exclaim\_mess} \\
&- 1.07 \times \texttt{number}_{small} - 0.42 \times \texttt{number}_{big}
\end{aligned}
$$
```{r}
#| label: tbl-emaillogmodel2
#| tbl-cap: |
#| The larger model. Summary of a logistic model for predicting whether an
#| email is spam based on the variables `to_multiple`, `attach`, `winner`,
#| `format`, `re_subj`, `exclaim_mess`, and `number`. Each of the variables
#| has its own coefficient estimate and p-value.
#| tbl-pos: H
glm(spam ~ to_multiple + attach + winner + format + re_subj + exclaim_mess + number, data = email, family = "binomial") |>
tidy() |>
mutate(p.value = ifelse(p.value < .0001, "<0.0001", round(p.value, 4))) |>
kbl(
linesep = "", booktabs = TRUE,
digits = 2, align = "lrrrr"
) |>
kable_styling(
bootstrap_options = c("striped", "condensed"),
latex_options = c("striped")
) |>
column_spec(1, width = "15em", monospace = TRUE) |>
column_spec(2:5, width = "5em")
```
```{r}
#| label: fig-emailCV2
#| fig-cap: |
#| The larger model. The coefficients are estimated using the least squares
#| model on 3/4 of the dataset with the seven specified predictor variables.
#| Predictions are made on the remaining 1/4 of the observations. Note that the
#| predictions are independent of the estimated model coefficients. The
#| predictions are now much better for both the spam and the non-spam emails
#| (than they were with a single predictor variable).
#| fig-alt: |
#| The left panel shows the logistic model predicting the probability of an
#| email being spam as a function of the email being sent to multiple
#| recipients, number of attachments, using the word winner, format of HTML,
#| RE in the subject line, number of exclamation points in the message, and
#| the existence of a number in the email; the model was built using the red,
#| green, and yellow triangular sections of the observed data. The right panel
#| shows a confusion matrix of the predicted spam label crossed with the
#| observed spam label for the set of observations in the blue triangular
#| section of the observed data. Of the 97 spam emails, 73 were correctly
#| classified as spam. Of the 883 non-spam emails, 688 were correctly classified
#| as not spam.
#| out-width: 100%
include_graphics("images/emailCV2.png")
```
```{r}
set.seed(470)
emailfolds <- caret::createFolds(email$spam, k = 4)
logCV2 <- data.frame(
predicted = glm(spam ~ to_multiple + attach + winner + format + re_subj + exclaim_mess + number, data = email[-emailfolds$Fold1, ], family = "binomial") |>
predict(newdata = email[emailfolds$Fold1, c("to_multiple", "attach", "winner", "format", "re_subj", "exclaim_mess", "number")], type = "response"),
obs = email[emailfolds$Fold1, ]$spam,
fold = rep("1st quarter", length(emailfolds$Fold1))
) |>
rbind(data.frame(
predicted = glm(spam ~ to_multiple + attach + winner + format + re_subj + exclaim_mess + number, data = email[-emailfolds$Fold2, ], family = "binomial") |>
predict(newdata = email[emailfolds$Fold2, c("to_multiple", "attach", "winner", "format", "re_subj", "exclaim_mess", "number")], type = "response"),
obs = email[emailfolds$Fold2, ]$spam,
fold = rep("2nd quarter", length(emailfolds$Fold2))
)) |>
rbind(data.frame(
predicted = glm(spam ~ to_multiple + attach + winner + format + re_subj + exclaim_mess + number, data = email[-emailfolds$Fold3, ], family = "binomial") |>
predict(newdata = email[emailfolds$Fold3, c("to_multiple", "attach", "winner", "format", "re_subj", "exclaim_mess", "number")], type = "response"),
obs = email[emailfolds$Fold3, ]$spam,
fold = rep("3rd quarter", length(emailfolds$Fold3))
)) |>
rbind(data.frame(
predicted = glm(spam ~ to_multiple + attach + winner + format + re_subj + exclaim_mess + number, data = email[-emailfolds$Fold4, ], family = "binomial") |>
predict(newdata = email[emailfolds$Fold4, c("to_multiple", "attach", "winner", "format", "re_subj", "exclaim_mess", "number")], type = "response"),
obs = email[emailfolds$Fold4, ]$spam,
fold = rep("4th quarter", length(emailfolds$Fold4))
)) |>
mutate(predspam = ifelse(predicted <= 0.1, 0, 1))
```
```{r}
#| label: tbl-email-spam2
#| tbl-cap: |
#| The larger model. One quarter at a time, the data were removed from the
#| model building, and whether the email was spam (TRUE) or not (FALSE) was
#| predicted. The logistic regression model was fit independently of the
#| removed emails. Now, the variables `to_multiple`, `attach`, `winner`,
#| `format`, `re_subj`, `exclaim_mess`, and `number` are used to predict
#| whether the email is spam. `spamTP` is the proportion of true spam emails
#| that were predicted to be spam. `notspamTP` is the proportion of true not
#| spam emails that were predicted to be not spam.'
#| tbl-pos: H
logCV2 |>
group_by(fold) |>
summarize(
count = n(),
accuracy = sum(obs == predspam) / n(),
notspamTP = sum(obs == 0 & predspam == 0) / sum(obs == 0),
spamTP = sum(obs == 1 & predspam == 1) / sum(obs == 1)
) |>
kbl(
linesep = "", booktabs = TRUE,
digits = 2
) |>
kable_styling(
bootstrap_options = c("striped", "condensed"),
latex_options = c("striped"), full_width = FALSE
)
```
Somewhat expected, the larger model (see @tbl-email-spam2) was able to capture more nuance in the emails which lead to better predictions.
However, it is not true that adding variables will always lead to better predictions, as correlated or noise variables may dampen the signal from the set of variables that truly predict the status.
We encourage you to learn more about multiple variable models and cross-validation in your future exploration of statistical topics.
\clearpage
## Chapter review {#sec-chp27-review}
### Summary
Throughout the text, we have presented a modern view to introduction to statistics.
Earlier, we presented graphical techniques which communicated relationships across multiple variables.
We also used modeling to formalize the relationships.
In @sec-inf-model-logistic we considered inferential claims on models which include many variables used to predict the probability of the outcome being a success.
We continue to emphasize the importance of experimental design in making conclusions about research claims.
In particular, recall that variability can come from different sources (e.g., random sampling vs. random allocation, see @fig-randsampValloc).
As you might guess, this text has only scratched the surface of the world of statistical analyses that can be applied to different datasets.
In particular, to do justice to the topic, the linear models and generalized linear models we have introduced can each be covered with their own course or book.
Hierarchical models, alternative methods for fitting parameters (e.g., Ridge Regression or LASSO), and advanced computational methods applied to multivariable models (e.g., permuting the response variable? one explanatory variable? all the explanatory variables?) are all beyond the scope of this book.
However, your successful understanding of the ideas we have covered has set you up perfectly to move on to a higher level of statistical modeling and inference.
Enjoy!
### Terms
The terms introduced in this chapter are presented in @tbl-terms-chp-26.
If you're not sure what some of these terms mean, we recommend you go back in the text and review their definitions.
You should be able to easily spot them as **bolded text**.
```{r}
#| label: tbl-terms-chp-26
#| tbl-cap: Terms introduced in this chapter.
#| tbl-pos: H
make_terms_table(terms_chp_26)
```
\clearpage
## Exercises {#sec-chp26-exercises}
Answers to odd-numbered exercises can be found in [Appendix -@sec-exercise-solutions-26].
::: {.exercises data-latex=""}
{{< include exercises/_26-ex-inf-model-logistic.qmd >}}
:::