-
Notifications
You must be signed in to change notification settings - Fork 2
/
10-MultivariateGLMs.Rmd
469 lines (297 loc) · 31.1 KB
/
10-MultivariateGLMs.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
# Visualizing Multivariate General Linear Models {#vizmvglms}
```{r, include=FALSE}
knitr::opts_chunk$set(
comment = "#>", echo = TRUE, message=FALSE, note=FALSE, warning=FALSE, cache=TRUE
)
```
Years ago I was chit-chatting with my dad. I don't remember how it came up, but we were talking about the death penalty. I am vehemently opposed to the death penalty. My dad, on the other hand, is in favor. As we debated the pros and cons, I mentioned certain races were disproportionately affected.
"Hogwash," he said. "I've seen the statistics. They say that black men are no more likely than white men to receive the death penalty."
"True," I said. I was familiar with the statistic he was quoting.
But that's not the whole story. If a black man kills another black person, he is just as likely to get the death penalty as a white man who kills a white person. However, if a black man kills a *white* person, they are *much* more likely to receive the death penalty than if a white man kills a black person.
That there is unfair, you see. (Not to mention the fact that it's government-sponsored killing, and all that).
My dad's response?
"Well son," he said, "That just goes to show you can use statistics to prove anything."
That pissed me off, I must admit. It was a dismissal of my argument--a protestation that said my statistics were no better than his. But, as we'll learn, simple bivariate effects (in this case, race and death penalty) are always less complete than multivariate effects (in this case, race of perpetrator *and* race of victim).
## What is a Multivariate Relationship?
That there is a tricky question. And, you'll probably get different answers, depending on who you ask. Technically, multivariate means multiple variables. So, I suppose, any analysis that has more than one variable is a multivariate analysis.
But, I've actually seen people refer to multivariate relationships in one of two ways:
1. *When one performs an analysis of multiple **dependent** variables.* When statisticians use the word multivariate, they tend to be referring to this sort of analysis, where multiple dependent variables are investigated. These situations call for using Multivariate Analysis of Variance (MANOVA), factor analysis, principle component analysis, structural equation modeling, and canonical correlations. I personally find many (but not all) of these sorts of analyses archaic and quite useless. They are largely atheoretical, they confuse exploratory and confirmatory research, and they're hard to make sense of. That is *not* what this chapter is about.
2. *When one performs an analysis of multiple **independent** variables.* Most everybody, with the exception of heritage breed statisticians, use the term multivariate to refer to situations where multiple independent variables are used. This is what *I* mean when I say multivariate in this textbook. Does that make me a mutt-breed statistician? Maybe. But, sometimes you have to sacrifice authenticity for practicality.
So, yes, this chapter (and those that follow) are referring to situations where we try to predict a single outcome variable from multiple predictor variables. Yay.
## Reasons to use multivariate GLMs
The world is rarely simple. Rarely does one variable *alone* cause another variable. Rather, variables work together to produce causes. Or, one variable's effects are contingent upon anothers'. We use mutivariate GLMs to gain a more complete picture of what is influencing our outcome variable.
But, that's a bit vague and stuff. Let me be a little more specific. **There are three reasons you would want to do a multivariate analysis**:
1. *Study interaction effects.* Sometimes variables "interact," which means their influence depends on other variable(s). For example, the annoyance I feel (outcome variable) about having department meetings (predictor variable #1) *depends* on whether there's food there (predictor variable #2); I'm happy to attend department meetings if they have baklava and pizza. Otherwise, boooo.
2. You want to *control* for things you're (likely) not interested in. For example, you might know depressed people tend to have crappy social lives, but you don't care about studying social functioning. You really care about depression's unique influence on, say, health. If that's the case, we want to "control" for social functioning.
3. To improve predictions. Long story short, the more variables you have, the better your predictions. So, naturally, if you want to predict who will be the next world wood-chopping champion, you can add more predictors (e.g., bicep circumference, years of experience, beard length).
Before I confabulate about each of these reasons, let's take a mini-break to look at some pictures. Why not, eh? Pictures are pretty.
Though, as to not bother my mythical editor, I'm going to embed these pretty pictures within the process of data analysis.
## Visualizing Multivariate Relationships in Flexplot
In the previous section, I said there are three primary reasons for doing a multivariate analysis: to study interaction effects, to control for things, and to improve predictions. I'm going to temporarily put the third reason (prediction) aside and focus on interactions and conditioning. These two approaches have very different visualization strategies. But, before I can share those strategies, I need to show you some tools.
Tool time!
### Encoding Additional Dimension Using Colors/Lines/Symbols or Panels
As a reminder, flexplot is essentially God-like in its ability to visualize data. For univariate data, remember all we have to do is write the variable name, followed by `~1`, then the dataset. For example...
```{r mvglmuniv}
require(flexplot)
data(avengers)
flexplot(speed~1, data=avengers)
flexplot(superpower~1, data=avengers)
```
For bivariate data, the pattern is similar, except we now put `outcome~predictor, data=dataset`. In the background, flexplot is going to figure out whether to plot a scatterplot or bee swarm plot:
```{r mvglmbivar}
flexplot(speed~superpower, data=avengers)
flexplot(speed~agility, data=avengers)
```
Yes, flexplot can read your mind. (Just make sure your computer is password protected. Flexplot hasn't yet figured out how to steal your secret recipes from your documents folder, but it's also pretty smart...so, yeah.)
For multivariate relationships, flexplot is even more powerful. In fact, when I originally conceptualized it, my whole purpose was to develop an easy interface for multivariate visualizations. So, that's where it truly shines.
When trying to visualize interactions, it is important to see all the variables simultaneously. But how exactly do you visualize multivariate data? Alas, we humans can really only see in 3-4 dimensions (width, height, depth, and time). But that's not even true. We don't actually see depth. Instead, we rely on sharpness, size, and shading to mimic a third dimension. Even 3D movies aren't truly 3D--they just trick our vision into believing we're actually seeing in 3D.
We *could* use the same tricks for visualization to see three-dimensional data. For example,
```{r mvglm3d}
require(scatterplot3d)
data(avengers)
scatterplot3d(avengers$speed, avengers$agility, avengers$superpower)
```
(By the way, I wouldn't bother installing that package. We won't use it again.)
I just find these sorts of plots hard to understand. If I were able to view them as a hologram projection, I might find them useful. (Help us Obi Wan, you're our only hope!). But as a static picture, it's a bit tough to tell what's going on. And, 3D plots are limited to three variables; good luck visualizing four dimensions!
So what do we do instead? We can embed the third dimension in the same two-dimensional plot using a different mode of representation.
Lemme show you what I mean. Let's visualize the same data using flexplot:
```{r mvglmflex1}
flexplot(agility~speed + superpower, data=avengers, method="lm")
```
Notice how I've embedded a third variable (superpower) in the same plot as the two-dimensional data. From this, it's much easier to see the multivariate relationship: superheroes tend to decrease slightly in agility when speed increases, while non-superheroes tend to increase.
Nice.
Let's look at another example, but this time with two numeric variables:
```{r mvglmnumerictwo}
flexplot(agility~strength + speed, data=avengers, method="lm")
```
What did Flexplot do? Well, you done handed it a numeric variable and asked it to plot three variables at the same time. Since Flexplot refuses to resort to 3D plots, you gave it no other option. It decided to break speed into three equal segments (think of it as slow, average, and fast). Then it plotted the datapoints for the slow folks in red, the average folks in green, and the fast folks in blue. It also represented these levels as separate lines (dotted, dashed, and dot-dashed) and symbols (circles, squares, triangles).
But now it's hard to see what's going on. Sure, we can see the lines alright, but the raw data are such a jumbled mess it's impossible to tell whether the regression lines actually fit the data.
Quite a conundrum, that.
So, what to do? Well, let's investigate another technique called paneling. Paneling means that, rather than plotting the three levels of the strength variable as separate lines/colors/symbols, it plots each of the three levels in their own scatterplot (called a panel). To specify a panel, we use the code `y~x | z`. In other words, we replace the `+` with a `|`. That button's right above the return button.
Let's take a look, see:
```{r mvglmpanel}
flexplot(agility~strength | speed, data=avengers)
```
Now we're talking! Now there's no overlaps and we can see that some of these relationships are actually quite curvilinear (particularly the strength/agility relationship for the slow folks).
Maybe we'll model that curvilinearity with some quadratic lines:
```{r mvglmquadratic}
flexplot(agility~strength | speed, data=avengers, method="quadratic")
```
That's better.
But now we have a new problem; when the third variable was in the same panel, it was easy to make comparisons across the third variable. But now, it's much harder. For example, in the image above, which curved line is higher on agility? The average speed folks or the slow folks?
Hard to say.
But, we can use a visual aid called a "ghost line." A ghost line repeats the line from one panel to another panel. This will make it much easier to compare lines across panels. Let's look at an example:
```{r mvglmghost}
flexplot(agility~strength | speed, data=avengers,
method="quadratic",
ghost.line="red")
```
Now we're getting somewhere! As you can see, the ghost line duplicates the relationship from the middle panel across to the other panels. Now it's very clear that the curve from the slow group (right-most plot) is higher in agility than the other two.
```{block, type='rmdnote'}
*Why call it 'Ghost Line'?*
Ghost line is kind of a funny name, isn't it? How did I come up with that name? Well, back when I was a teenager, Nintendo 64 came out. That, my friends, was a brilliant system. One of its hallmark games was Mariokart. In Mariokart, you could choose which character to race with (e.g., Bowser, Mario, Luigi), then you would circle a track, trying to avoid shell-missiles.
Anyhow, as I recall, whenever you were racing and set a speed record, the game would actually record your every move during the entire race. You then had the option to turn on the "ghost racer." This ghost racer was a semi-transparent image of the exact race that holds the record. The basic idea was that you could try to race against the best race ever recorded.
Cool idea and cool concept. So much so, that I decided I would borrow the idea. Like in Mariokart, the ghost line doesn't actually reflect anything in that particular panel, but it is a representation of something outside the panel.
```
Ghost lines, it turns out, are essential for doing multivariate analyses, particularly when we start adding even *more* variables to the analysis. They make it easy to compare fits across panels.
Let's look at another example, but this time we'll panel on two variables:
```{r mvglmghost2}
flexplot(shots.taken~agility | speed + superpower, data=avengers,
method="quadratic",
ghost.line="red")
```
Notice how *nearly every single datapoint* for the superhero plots (the bottom panel) are below the fit of the ghost line. In other words, superheroes are taking far fewer shots than non-superheroes. We would not be able to see this without the ghost line.
We'll see several examples in later chapters where the ghost line is essential to helping us figure out what's going on.
By the way, you may have noticed that the ghost line generally picks the middle-most panel as the reference. But, you can tell Flexplot which panel to reference from. Suppose, for example, if we wanted to reference from the fast panel for superheroes. To do so, we would need to specify a list that tells flexplot the speed and superpower value to reference from:
```{r mvglmghostref}
flexplot(shots.taken~agility | speed + superpower, data=avengers,
method="quadratic",
ghost.line="red",
ghost.reference = list(speed=4.5, superpower="yes"))
```
Here, we've chosen a value for speed that is in the range of the left-most panel (4.5 is within the 4.3-4.9 range) and told flexplot to choose "yes" as the superpower value. We could also specify slow non-superheroes:
```{r mvglmghostref2}
flexplot(shots.taken~agility | speed + superpower, data=avengers,
method="quadratic",
ghost.line="orange",
ghost.reference = list(speed=5.2, superpower="no"))
```
Notice I also changed the color of the ghost line to orange.
By the way, you can (realistically) visualize up to four predictor variables. For example:
```{r mvglmfourvars}
flexplot(shots.taken~agility + superpower | speed + strength, data=avengers,
method="lm",
ghost.line="black")
```
Now, Flexplot is representing the second variable as different lines/colors/symbols while paneling the remaining two variables (speed and strength). Also, the ghost line is referencing off of the middle-most panel for the superheroes.
Don't worry about interpreting this plot yet; I'll soon give you some cool tricks you can use to figure out what's going on here. The important take-home message for you is understanding how you can control the display of flexplot.
In summary, flexplot visualizes multivariate data by encoding additional variables as either a separate color/symbol/line, or in separate panels. To encode a variable as a color/line/symbol, use `y~x+z`. To encode as separate panels, use `y~x|z`. When we panel, it makes it hard to compare fits, so we can use "ghost lines" to repeat the pattern from one panel to another.
## What are we looking for when studying a flexplot visual?
So, these graphics are all well and good, but what's the point? What information do they tell us? Or, maybe the better question is what are we looking for when studying a flexplot visual?
That's a fantastic question! Pat yourself on the back and buy yourself some ice cream :)
There are three things we're looking for when looking at a Flexplot graphic:
(1) Trends on the X-axis. This indicates a main effect of the variable on the X-axis.
(2) Nonparallel lines. This indicates there's an interaction between the variable on the X-axis and one of the other variables.
(3) Nonlinear effects. This is why Flexplot defaults to showing loess lines: if there are nonlinear effects, it will detect them.
I know, that was a lot of information. Let's go ahead and talk about each of these buggers, one at a time, shall we?
### Identifying trends in Flexplot
You already know this. Trust me. Remember back in our [bivariate visualization chapter](#bivariate_visuals) we identified any association between two variables by any sort of slope between X and Y. It's no different for multivariate plots.
The image below shows two plots; the right image has little to no trend. The plot on the left has a trend.
```{r mvglmtrend}
require(tidyverse)
set.seed(12332224)
x = rnorm(900)
y = .5*x + rnorm(length(x), 0, sqrt(1-.5^2))
z = rnorm(length(x))
d = data.frame(x=x, y=y, z=z)
a = flexplot(y~x | z, data=d, method="lm", se=F) + labs(title="Flexplot with a Trend (Slope)") +
theme(plot.title = element_text(size=18))
b = flexplot(y~z | x, data=d, method="lm", se=F) + labs(title="Flexplot without a Trend (Slope)") +
theme(plot.title = element_text(size=18))
require(patchwork)
a + b
```
### Identifying nonparallel lines in Flexplot
I'm not going to get into too much detail on this one in this chapter. That's the subject of my [chapter on interaction effects](#mvinteractions). But it's helpful for you to know this now, so you know why I'm telling you all this information.
Let's look at a graphic first. The plot on the left shows mostly parallel lines (with a red ghost line). What does that mean? It says that the relationship between X and Y is consistent: it doesn't matter the level of Z, the relationship between X and Y is consistently strong and positive. In the right plot, on the other hand, the relationship between X and Y changes drastically, depending on the level of Z; when Z is low, the relationship is essentially zero. As Z increases, the relationship between X and Y strengthens.
```{r mvglminteraction, echo=FALSE}
set.seed(123324)
n = 800
x = rnorm(n)
z1 = rnorm(n)
z2 = rnorm(n)
y1 = .5*x + .3*z1 + rnorm(length(x), 0, sqrt(1-.5^2))
y2 = .5*x + .3*z2 + .4*z2*x + rnorm(length(x), 0, sqrt(1-.5^2))
d = data.frame(x=x, y1=y1, y2=y2, z1=z1, z2=z2)
require(flexifiers)
a = (flexplot(y1~x | z1, data=d, method="lm", se=F, ghost.line="red") +
labs(title="Flexplot with Parallel Lines", x="X", y="Y") +
theme(plot.title = element_text(size=14))) %>%
modify_labels(row_panels="Z")
b = (flexplot(y2~x | z2, data=d, method="lm", se=F, ghost.line="red") +
labs(title="Flexplot with Nonparallel Lines", x="X", y="Y") +
theme(plot.title = element_text(size=14))) %>%
modify_labels(col_panels="Z")
require(patchwork)
a + b
```
We call this an "interaction effect." An interaction effect says that the effect of one variable on another (e.g., X on Y) depends on the level of a third variable (Z in this case). Interaction effects are pretty freaking important in stats--if they exist and we fail to identify them, it really screws up our model! So we want to be able to identify them. With ghost lines (and Flexplot), we have a way: if the lines are not approximately parallel, there's an interaction in the data that needs to be modeled.
### Identifying nonlinear effects
Like interaction effects, nonlinear effects are important to detect, if they're there. If we miss them (and don't model them), it really screws up our model and we might not even know it!
Fortunately, it's easy to identify nonlinear effects. We just ask Flexplot to show loess lines. Remember that loess lines bend with the data. So, if there's a bend, it will show that bend. Here's an example below. The plot on the left shows no nonlinear effects. The one on the right shows a nonlinear effect. When looking at a flexplot graphic with multiple panels, it's important to look for evidence of nonlinearity in any panel.
```{r mvglmnonlinear, echo=FALSE}
set.seed(123324)
n = 800
x = rnorm(n)
y1 = .5*x + rnorm(length(x), 0, sqrt(1-.5^2))
y2 = .5*x -.2*x^2 + rnorm(length(x), 0, sqrt(1-.5^2))
d = data.frame(x=x, y1=y1, y2=y2)
a = flexplot(y1~x , data=d, method="loess", se=F) +
labs(title="Flexplot without nonlinear effects", x="X", y="Y") +
theme(plot.title = element_text(size=14))
b = flexplot(y2~x , data=d, method="loess", se=F) +
labs(title="Flexplot with nonlinear effects", x="X", y="Y") +
theme(plot.title = element_text(size=14))
require(patchwork)
a + b
```
These multipaneled flexplots are kind of tricky to interpret, but they're pretty necessary. They allow us to detect interaction/nonlinear effects. Assuming there are none of these sorts of effects, we can use a *much simpler visualization tool* called Added Variable Plots.
But, that condition is pretty important, so I'm going to say it again, but with bold this time: **added variable plots are an easy way to interpret multivariate data, but you can only use them if there are no interactions/nonlinear effects in your model.**
(This statement is actually a bit simplified, and you can actually use them for visualizing models with these sorts of effects, but I'm not going to get into that here. If you're interested, you can read about it [in a paper I wrote](https://psyarxiv.com/avu2n/). )
### Encoding Additional Dimensions Using Added Variable Plots
Earlier I said we used multivariate analyses for two primary purposes: (1) to identify interactions, and (2) to "control" for other variables. The previous section displayed the additional variables colors/symbols/lines or as separate panels. These plots are great for detecting interactions. (Again, we'll review how this is done later). If there are no interactions, we can simplify the visualization immensely by using added variable plots (or AVPs).
Before we talk about AVPs, though, let me offer some definitions of important terms. Let's say you know that one's social support structure affects health. You also know that depression affects health, but you want to see the unique effect of depression on health, above and beyond social structure. When doing this sort of analysis, we have three types of variables: the outcome variable (you already know what this is), the "control" variable, and the "interest" variable.
**The control variable is the variable you're trying to control for.** Usually, we don't care about this variable. In our example, social support is the control variable.
**The interest variable is the variable whose unique effect you're interested in studying.** In this example, our interest variable is depression, because that's the variable whose influence we're actually interested in studying.
You can have multiple control variables (e.g., what is the effect of depression on health, after controlling for a. social support, and b. exercise frequency). You could also have multiple interest variables, but this is less common.
Let me quickly tell you what AVPs do, before I go into more detail:
**AVPs fit a model between the control variable(s) and the outcome, calculates the residuals, then plots those residuals against the interest variable.**
Again, I'll go over this in detail and explain *why* we do this. But, before I lose your interest or before I lose you in the details, here's an example of an AVP in R:
```{r mvglmavp1}
added.plot(agility~speed + strength, data=avengers)
```
By default, AVP's control for the first variable(s) entered (speed), then plot the last variable entered (strength) on the x-axis.
#### What Do AVPs Do and Why?
This will be easiest to explain with a simple example. (To see a more full explanation/example, see [my YouTube video on AVPs](https://youtu.be/AK1iZyY6lMo)) Let's say we have the following dataset:
```{r mvglmtab, echo=FALSE}
set.seed(121211)
cor.mat = matrix(c(
1, -.6, .9,
-.6, 1, -.2,
.9, -.2, 1
), nrow=3)
require(tidyverse)
d = data.frame(MASS::mvrnorm(n=10, mu=c(0,0,0), cor.mat))
names(d) = c("health", "depression", "support")
d = d %>% mutate(health = rescale(health, 20, 4) %>% floor_ceiling(10,30) %>% round,
depression = rescale(depression, 12,5) %>% floor_ceiling(1, 30) %>% round,
support = rescale(support, 50, 15) %>% floor_ceiling(0) %>% round)
mod = lm(health~support, data=d)
knitr::kable(d)
```
Here, `support` is our control variable. Let's go ahead and look at a scatterplot of that model:
```{r mvglmavpfitmod, echo=FALSE}
flexplot(health~support, data=d, method="lm", se=F)
```
The regression line has the equation: $\text{health} = `r round(coef(mod)[1], digits=1)` + `r round(coef(mod)[2], digits=1)`\times \text{support}$. Let's go ahead and put each person's predicted score back in the table, as well as their residual:
```{r, echo=FALSE}
mod = lm(health~support, data=d)
d$prediction = predict(mod) %>% round(1)
d$residual = residuals(mod) %>% round(1)
knitr::kable(d)
```
Just for fun, go ahead and verify that the column labeled `prediction` is equal to `r coef(mod)[1]` + `r coef(mod)[2]`$\times$ support. Also, go ahead and verify that the column labeled `residual` is equal to health - prediction.
Now, we have all we need to create an AVP. All we have to do is plot depression on the $X$ axis and the column labeled "residual" on the $Y$ axis:
```{r mvglmresiduals, echo=FALSE}
flexplot(residual~depression, data=d, method="lm", se=F)
```
What is this telling us? Remember what a residual means: *a residual tells us what remains in the outcome variable after removing the effect of whatever we modeled*. Here, we modeled the effect of social support on health. So, this residual represents health, after removing the effect of social support.
And that, my friends, is exactly what we were looking for! Whenever we are trying to "control" or "condition" on some variable, the appropriate visual to use is an AVP.
By the way, I didn't invent AVPs. (I wish I had!). But, I have made a few cool modifications to them.
### Dustin's Cool Modifications to Added Variable Plots
I think most people would be confused looking at added variable plots. Remember, we're plotting *residuals*, which means your outcome variable will now be centered on zero. I think if reviewers see this, they might have a heart attack.
"Wait a minute!" they might say, "Your DV ranges from 10-60 and yet your Y-axis ranges from -10 to 10. Clearly the authors are bozos and have done something wrong! Reject!"
Of course, you, being the wise statistical Jedi that you are, would recognize that it's plotting a *residual*, not a raw score. To avoid confusion from under-educated reviewers, Flexplot actually adds the mean of the DV back into the residuals. This makes sure the Y axis is on the same scale as the original DV. Flexplot also relabels the $Y$-axis to highlight that it's showing *residuals* not actual outcome scores. (In this case, it would say `health | support`).
So, in short, AVPs literally fit a linear model predicting the outcome from the control variable(s), compute the residuals, then plot the interest variable against these residuals. These plots visually show us the effect of the interest variable, after controlling for the control variable(s).
These were small modifications, I admit. Rescaling the residuals and adding labels to the Y-axis is more intuitive, but the next modification is more extreme. We're going to combine the awesomeness of Flexplot with the awesomeness of AVPs.
Remember how Flexplot is able to visualize multiple variables at a time? After a bit of math, I figured out how to extend AVPs to multivariate plots. Let's look at an example:
```{r mvglmmvavp, echo=FALSE}
added.plot(weight.loss~motivation + therapy.type,
lm_formula = weight.loss~health*muscle.gain,
data=exercise_data)
```
THe above plot is showing the relationship between two predictor variables (therapy.type and motivation) on weight loss, after controlling for health/muscle gain (and their interaction). That's pretty sweet.
#### Using Flexplot To Do AVPs
There's really two ways to use AVPs. One way is easy, but less flexible. The other is harder, but much more flexible. If you only have one interest variable and your control variables have simple relationships with the outcome, you can create AVPs like this:
```{r mvglmavpeasy}
added.plot(ptsd~agility + north_south, data=avengers)
```
The last variable entered (in this case, `north_south`) will be treated as the interest variable. All others will be control variables. Here's an example with multiple control variables:
```{r mvglmavpeasy2}
added.plot(ptsd~agility + speed + superpower + kills + north_south, data=avengers)
```
(These plots seem to say that, even after controlling for agility, speed, superpowers, and # of kills, which battlefield one fought on [north versus south] *still* predicts ptsd).
Sometimes, however, you might want to have multiple interest variables. Or maybe you have quadratic and/or interaction effects you need to model for the control variables. In that case, you'll need to add the `lm.formula` argument to AVP:
```{r mvglmcomplex}
added.plot(ptsd~willpower + north_south, data=avengers,
lm_formula = ptsd~minutes.fighting + I(minutes.fighting^2))
```
The above example fits a quadratic (nonlinear) regression between `minutes.fighting` and `ptsd`, then plots willpower and north_south against these residuals. In other words, `minutes.fighting` (and the nonlinear effect of `minutes.fighting`) are the "control variables" and willpower and north_south are the interest variables.
In other words, the `lm_formula` argument simply tells R which predictors to use to create the residuals, and the formula at the beginning is interpreted exactly as it would be in flexplot.
## Summary
For this book, "multivariate" means we have multiple independent variables. We use multivariate analyses for three reasons 1) to model interactions, 2) to control for things, and/or 3) to predict things.
When visualizing, it's important to consider the role of each variable. In this chapter, I mentioned three roles: the dependent variable, the control variable (which we usually don't care about), and the interest variable.
To visualize multivariate data, we can either encode our third variable as a different color/line/symbol (using the syntax `y~a+b`), or we can panel that variable (using the syntax `y~a|b`. Using separate lines/colors/symbols tends to clutter the display, while panels make it harder to compare across panels. To make it easier to compare across panels we can use ghost lines.
When studying paneled plots, we're looking for trends, curvilinearity, and/or interactions. Ideally, there are no interactions and we won't have to visualize multiple variables together, but can instead use added variable plots. Using added variable plots only makes sense if our variables don't interact with one another.
With AVPs, we can either use the syntax `outcome ~ control_variable + interest_variable`, where the last variable entered is the one displayed on the x-axis (while the other variables are residualized), or we could use this syntax: `added_plot(y~a + b, lm_formula = y~c + d, data=d)`. In the latter case, the formula specified in `lm_formula` tells the avp which variables to create residuals from, while the other variable(s) (in this case `a` and `b`) are displayed using regular flexplot syntax.
## Practice
1. Describe how each of the following relationships will look. Try drawing a plot that matches your expectations. Then plot the actual relationships. How close were you?
```{r, eval=FALSE}
flexplot(superpower~1, data=avenger)
flexplot(speed~agility + superpower, data=avenger)
flexplot(agility~superpower | speed, data=avenger)
flexplot(agility~speed + strength | superpower, data=avenger)
flexplot(agility~speed | strength + superpower, data=avenger)
```
2. Suppose we wanted to plot the relationship between agility and speed, after controlling for superpower. How would you use an AVP to do that?
3. Suppose our interest variables are speed and agility, our outcome variable is ptsd, and our control variables are kills and injuries (and the interaction between the two). How would you use an AVP to visualize that?