-
Notifications
You must be signed in to change notification settings - Fork 18
/
graphical-models.Rmd
1021 lines (707 loc) · 74.9 KB
/
graphical-models.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
# (PART\*) Part I: Core SEM {-}
# Graphical Models
```{r setupGM, include=FALSE, cache=FALSE}
knitr::clean_cache(clean=T)
knitr::opts_chunk$set(cache.rebuild=F, cache=F)
library(lavaan); library(DiagrammeR); library(bnlearn); library(visNetwork); library(igraph) # only load mediation as needed
```
A graphical model can be seen as a mathematical or statistical construct connecting *nodes* (or vertices) via *edges* (or links). When pertaining to statistical models, the nodes might represent variables of interest in our data set, and edges specify the relationships among them. Visually they are depicted in the style of the following examples.
<img src="img/graphicalModels.png" style="display:block; margin: 0 auto; width:50%">
```{r graphVizExample, echo=FALSE, eval=FALSE}
# encoding issue with icons means just creating the png instead
tags$div(style="width:50%; margin-left:auto; margin-right:auto; background:transparent; background-color:'#fffff8'",
DiagrammeR("
digraph DAG {
# Intialization of graph attributes
graph [overlap = false fill='#fffff8']
# Initialization of node attributes
node [shape = circle,
fontname = Helvetica,
color = gray80,
type = box,
fixedsize = true]
# Initialization of edge attributes
edge [color = gray25, rel = yields]
# Node statements
X; Y; Z; ☺; 1; 2; 3;
node [fontcolor = 'darkred']
♥; ☻; # Diagrammer/graphviz doesn't recognize this
# Edge statements
X -> Y; Z -> Y;
☺ -> ♥; ♥ -> ☻; ☺->☻;
# Initialization of edge attributes
edge [color = gray25, rel = yields, arrowhead=none]
1 -> 2; 2 -> 3; 1 -> 3;
}
", type='grViz')
)
```
Any statistical model you've conducted can be expressed as a graphical model. As an example, the first graph with nodes X, Y, and Z might represent a regression model in which X and Z predict Y. The emoticon graph shows an indirect effect, and the 123 graph might represent a correlation matrix.
A key idea of a graphical model is that of <span class="emph">conditional independence</span>, something one should keep in mind when constructing their models. The concept can be demonstrated with the following graph.
```{r conditionalIndependence, echo=FALSE, eval=F}
tags$div(style="width:500px; margin:0 auto; font-family:Roboto; font-size:50%;",
DiagrammeR::DiagrammeR("scripts/condindie.gv", type='grViz', width=500, height=250))
# export_graph() # won't work except on dgr objects of course
# <img src="img/conditionalIndependence.png" style="display:block; margin: 0 auto; width:25%">
```
<br>
In this graph, X is *conditionally independent* of Z given Y- there is no correlation between X and Z once Y is accounted for[^condind]. We will revisit this concept when discussing path analysis and latent variable models. Graphs can be *directed*, *undirected*, or *mixed*. Directed graphs have arrows, sometimes implying a causal flow (a difficult endeavor to demonstrate explicitly) or noting a time component. Undirected graphs merely denote relations among the nodes, while mixed graphs might contain both directional and symmetric relationships. Most of the models discussed in this document will be directed or mixed.
## Directed Graphs
As noted previously, we can represent standard models as graphical models. In most of these cases we'd be dealing with directed or mixed graphs. Almost always we are specifically dealing with directed *acyclic* graphs, where there are no feedback loops.
### Standard linear model
Let's start with the standard linear model (SLiM), i.e. a basic regression we might estimate via ordinary least squares (but not necessarily). In this setting, we want to examine the effect of each potential predictor (x* in the graph) on the target variable (y). The following shows what the graphical model might look like.
```{r SLiMasGraph, echo=FALSE, eval=TRUE}
# library(lavaan)
# X = cbind(1, matrix(rnorm(150), 50))
# y = X %*% c(2,.2,-.2,0) + rnorm(50)
# d = data.frame(X1 = X[,2], X2=X[,3], X3=X[,4], y)
# model = "
# y ~ X1 + X2 + X3
# "
# out = sem(model, data=d)
# psych::lavaan.diagram(out, regression=TRUE)
#
# # X7 <- matrix(c("a","b","c","d","e","f"),nrow=6)
# # structure.diagram(X7,regression=TRUE)
#
#
# d = matrix(c('', '',''), nrow=3)
# psych::structure.diagram(d,regression=TRUE, labels=c('X1', 'X2', 'X3','Y'))
#
#
# library(DiagrammeR)
tags$div(style="width:500px; margin:0 auto; font-family:Roboto; font-size:50%;",
DiagrammeR::DiagrammeR("scripts/SLiM.gv", type='grViz', width=500, height=250))
```
<br>
<!-- <img src="img/regasgraph.png" style="display:block; margin: 0 auto; width:50%"> -->
While we start simpler, we can use much of the same sort of thinking with more complex models later. In what follows, we'll show that whether we use a standard R modeling approach (via the <span class="func">lm</span> function), or an SEM approach (via the <span class="func">sem</span> function in <span class="pack">lavaan</span>), the results are identical aside from the fact that sem is using maximum likelihood (so the variances will be slightly different). First, we'll with the standard <span class="func">lm</span> approach.
```{r lmvslavaan1, cache=TRUE}
mcclelland = haven::read_dta('data/path_analysis_data.dta')
lmModel = lm(math21 ~ male + math7 + read7 + momed, data=mcclelland)
```
Now we can do the same model using the lavaan package, and while the input form will change a bit, and the output will be presented in a manner consistent with SEM, the estimated parameters are identical. Note that the square of the residual standard error in the <span class="func">lm</span> output is compared to the variance estimate in the lavaan output.
```{r lmvslavaan2, cache=TRUE}
library(lavaan)
model = "
math21 ~ male + math7 + read7 + momed
"
semModel = sem(model, data=mcclelland, meanstructure = TRUE)
summary(lmModel)
summary(semModel, rsq=T)
```
As we will see in more detail later, SEM incorporates more complicated regression models, but at this point the parameters have the exact same understanding as our standard regression, because there is no difference between the two models except conceptually, in that SEM explicitly brings a causal aspect to bear on the interpretation. As we go along, we can see the structural equation models statistically as generalizations of those we are already well acquainted with, and so one can use that prior knowledge as a basis for understanding the newer ones.
### Path Analysis
<span class='emph'>Path Analysis</span>, and thus SEM, while new to some, is in fact a very, very old technique, statistically speaking[^pathOld]. It can be seen as a generalization of the SLiM approach that can allow for indirect effects and multiple target variables. Path analysis also has a long history in the econometrics literature though under different names (e.g. instrumental variable regression, 2-stage least squares etc.), and through the computer science realm through the use of graphical models more generally. As such, there are many tools at your disposal for examining such models, and I'll iterate that much of the SEM perspective on modeling comes largely from specific disciplines (especially psychology and education), while other approaches that are very common in other disciplines may be better for your situation.
#### Types of relationships
The types of potential relationships examined by path analysis can be seen below.
<img src="img/correlationComponents.png" style="display:block; margin: 0 auto; width:50%; height:50%;">
Most models deal only with <span class="emph">direct effects</span>. In this case there are no intervening variables we wish to consider. If such variables do exist, we are then dealing with what is often called a mediation model, and must interpret both <span class="emph">indirect</span> and (potentially) direct effects. When dealing with multiple outcomes, some may have predictors in common, such that the outcomes' correlation can be explained by those <span class="emph">common causes</span> (this will come up in factor analysis later). Often there are <span class="emph">unanalyzed correlations</span>. As an example, every time you run a regression, the correlations among predictor variables are left 'unanalyzed'.
#### Multiple Targets
While relatively seldom used, multivariate linear regression[^mvreg] is actually very straightforward in some programming environments such as R, and conducting models with multivariate outcomes does not require anything specific to SEM, but that is the realm we'll stick with. Using the [McClelland data][McClelland], let's try it for ourselves. First, let's look at the data to get a sense of things.
```{r mcclellanddat, echo=F, cache=T}
# note something has seriously borked with htmlwidgets and bookdown as of May 2018
describe_all_num(mcclelland) %>%
kable(format = 'html') %>%
kable_styling(full_width = F)
numericdat = select(mcclelland, -college, -college_missing, -male, -adopted)
```
<br>
```{r mcclellandCors, echo=F, cache=FALSE}
# this is the only thing working right now
# tags$div(style="width:250px; margin:0 auto;",
# corr_heat(cor(numericdat, use='pair')))
corr_heat(cor(numericdat, use='pair'))
```
<br>
While these are only weakly correlated variables to begin with, one plausible model might try to predict math and reading at age 21 with measures taken at prior years.
```{r paMultipleTargets}
model = "
read21 ~ attention4 + vocab4 + read7
math21 ~ attention4 + math7
read21 ~~ 0*math21
"
mvregModel = sem(model, data=mcclelland, missing='listwise', meanstructure = T)
coef(mvregModel)
```
The last line of the model code clarifies that we are treating **math21** and **read21** as independent, and as such we are simply running two separate regressions simultaneously. Note also that the coefficients in the output with `~1` are the intercepts. We can compare this to standard R regression. A first step is taken to make the data equal to what was used in <span class="pack">lavaan</span>. For that we can use the <span class="pack">dplyr</span> package to select the necessary variables for the model, and then omit rows that have any missing.
```{r regvspath1, cache=TRUE}
mcclellandComplete = select(mcclelland, read21, math21, attention4, vocab4, read7, math7) %>%
na.omit
lm(read21 ~ attention4 + vocab4 + read7, data=mcclellandComplete)
lm(math21 ~ attention4 + math7, data=mcclellandComplete)
```
Note that had the models for both outcomes been identical, we could have run both outcomes simultaneously using <span class="func">cbind</span> on the dependent variables[^notmvreg]. However, we can and probably should estimate the covariance of math and reading skill at age 21. Let's rerun the path analysis removing that <span class="emph">constraint</span>.
```{r paMultipleTargets2, cache=TRUE}
model = "
read21 ~ attention4 + vocab4 + read7
math21 ~ attention4 + math7
"
mvregModel = sem(model, data=mcclelland, missing='listwise', meanstructure = T)
coef(mvregModel)
```
We can see that the coefficients are now slightly different from the SLiM approach. The `read21~~math21` value represents the residual covariance between math and reading at age 21, i.e. after accounting for the other covariate relationships modeled, it tells us how correlated those skills are. Using <span class='func'>summary</span> will show it to be statistically significant. I additionally ask for standardized output to see the correlation along with the covariance. It actually is not very strong even if it is statistically significant.
```{r summaryPaMultipleTargets2, eval=T, cache=TRUE}
summary(mvregModel, standardized=T)
```
Whether or not to take a multivariate/path-analytic approach vs. separate regressions is left to the researcher. It's perhaps easier to explain univariate models in some circumstances. But as the above shows, it doesn't take much to take into account correlated target variables.
#### Indirect Effects
So path analysis allows for multiple target variables, with the same or a mix of covariates for each target. What about <span class='emph'>indirect effects</span>? Standard regression models examine direct effects only, and the regression coefficients reflect that direct effect. However, perhaps we think a particular covariate causes some change in another, which then causes some change in the target variable. This is especially true when some measures are collected at different time points. Note that in SEM, any variable in which an arrow is pointing to it in the graphical depiction is often called an <span class="emph">endogenous</span> variable, while those that only have arrows going out from them are <span class="emph">exogenous</span>. Exogenous variables may still have (unanalyzed) correlations among them. As we will see later, both observed and latent variables may be endogenous or exogenous.
Consider the following model.
```{r mediationDiag, echo=FALSE, eval=TRUE, cache=FALSE, fig.align='center'}
# tags$div(style="width:500px; margin:0 auto; font-family:Roboto; font-size:50%",
# DiagrammeR::grViz('scripts/mediation.gv', width=500, height=250))
# DiagrammeR::grViz('scripts/mediation.gv')
```
<img src="img/mediation.png" style="display:block; margin: 0 auto;" width=50%>
Here we posit attention span and vocabulary at age 4 as indicative of what to expect for reading skill at age 7, and that is ultimately seen as a precursor to adult reading ability. In this model, attention span and vocabulary at 4 only have an indirect effect on adult reading ability through earlier reading skill. At least temporally it makes sense, so let's code this up.
```{r mediationModel, cache=TRUE}
model = "
read21 ~ read7
read7 ~ attention4 + vocab4
"
mediationModel = sem(model, data=mcclelland)
summary(mediationModel, rsquare=TRUE)
```
What does this tell us? As before, we interpret the results as we would any other regression model, though conceptually there are two sets of models to consider (though they are estimated simultaneously[^oldMediation]), one for reading at age 7 and one for reading at age 21. And indeed, one can think of path analysis as a series of linked regression models[^linkedreg]. Here we have positive relationships between both attention and vocabulary at age 4 and reading at age 7, and a positive effect of reading at age 7 on reading at age 21. Statistically speaking, our model appears to be viable, as there appear to be statistically significant estimates for each path.
However, look at the R^2^ value for reading at age 7. We now see that there is little to no *practical* effect of the age 4 variables at all, as all we are accounting for is < 4% of the variance, and all that we have really discovered is that prior reading ability affects later reading ability.
We can test the indirect effect itself by labeling the paths in the model code. In the following code, I label them based on the first letter of the variables involved (e.g. `vr` refers to the vocab to reading path), but note that these are arbitrary names. I also add the direct effects of the early age variable. While the indirect effect for vocab is statistically significant, as we already know there is not a strong correlation between these two variables, it's is largely driven by the strong relationship between reading at age 7 and reading at age 21, which is probably not all that interesting. A comparison of AIC values, something we'll talk more about later, would favor a model with only direct effects[^medvsdirectModcomp].
```{r mediationModelTestIndir, cache=TRUE}
model = "
read21 ~ rr*read7 + attention4 + vocab4
read7 ~ ar*attention4 + vr*vocab4
# Indirect effects
att4_read21 := ar*rr
vocab4_read21 := vr*rr
"
mediationModel = sem(model, data=mcclelland)
summary(mediationModel, rsquare=TRUE, fit=T, std=T)
```
In the original article, I did not find their description or diagrams of the models detailed enough to know precisely what the model was in the actual study[^mccnodescript], but here is at least one interpretation if you'd like to examine it further.
```{r mcclellandStudy, eval=FALSE, cache=TRUE}
modReading = "
read21 ~ read7 + attention4 + vocab4 + male + adopted + momed
read7 ~ attention4 + vocab4
"
reading = sem(modReading, data=mcclelland, missing='fiml', mimic = 'Mplus', std.ov=TRUE)
summary(reading, rsquare=TRUE)
```
#### More mediation
Here I would like to demonstrate a powerful approach to mediation models using Imai's <span class="pack">mediation</span> package. It is based on a theoretical framework that allows for notably complex models for mediator, outcome or both, including different types of variables for the mediator and outcome. While it is simple to conduct, the results suggest there are more things to think about when it comes to mediation.
To begin, we will run two separate models, one for the mediator and one for the outcome, using the mediate function to use them to estimate the mediation effect. In this case we'll focus on the indirect effect of attention.
```{r imai0, cache=F, echo=F, eval=FALSE}
# due to time and mediation loading everything under the sun, just run this and
# save out as RData; show the next one so you don't have comment signs
library(mediation)
mediator_model = lm(read7 ~ attention4 + vocab4, data=mcclellandComplete)
outcome_model = lm(read21 ~ read7 + attention4 + vocab4, data=mcclellandComplete)
results = mediate(mediator_model, outcome_model, treat="attention4", mediator = 'read7')
save(results, file='data/imai_results.RData')
```
```{r imai, cache=F, eval=FALSE}
library(mediation)
mediator_model = lm(read7 ~ attention4 + vocab4, data=mcclellandComplete)
outcome_model = lm(read21 ~ read7 + attention4 + vocab4, data=mcclellandComplete)
results = mediate(mediator_model, outcome_model, treat="attention4", mediator = 'read7')
summary(results)
```
```{r printImai, echo=FALSE}
load('data/imai_results.RData')
mediation::summary.mediate(results)
```
```{r removeMASS, include=FALSE, eval=F}
# sigh; if you do run the previous chunk
detach(package:mediation); detach(package:MASS)
```
The first thing to note is that there are multiple results to consider, we have the average causal mediated effect, the average direct effect, the total effect and the proportion mediated. What is an <span class="emph">average causal mediated effect</span>? First, think about a standard experimental design in which participants are assigned to control and treatment groups. Ideally, if we really wanted to know about true effects, we'd see the outcome when an individual was a control and when they were were in the treatment. The true *causal* effect for an individual would be the difference between their scores when in the control and when in the treatment. Thus there are two hypothetical situations of different conditions, or <span class="emph">counterfactuals</span>, of interest- what would be the *potential* outcome of a control case if they treated, and vice versa for those in the treated group[^potout_vs_counterfact].
This <span class="emph">potential outcomes framework</span> is the essence of the <span class="func">mediate</span> function. In the presence of a mediator variable, we can start by defining the <span class="emph">average direct effect</span> (ζ) as follows, where $Y_i$ is the potential outcome for unit $i$ (e.g. a person), $M_i$ the potential outcome of the mediator, and $T_i$ the treatment status. From the mediate function help file[^imai_links]:
> The expected difference in the potential outcome when the treatment is changed but the mediator is held constant at the value that would realize if the treatment equals $t$
$$ζ(t) = E[Y(t_1, M(t)) - Y(t_0, M(t))]$$
The **ACME** meanwhile is:
> The expected difference (δ) in the *potential outcome* when the mediator took the value that would realize under the treatment condition as opposed to the control condition, while the treatment status itself is held constant.
$$δ(t) = E[Y(t, M(t_1)) - Y(t, M(t_0))]$$
> where $t$, $t_1$, $t_0$ are particular values of the treatment T such that $t_1 \neq t_0$, M(t) is the potential mediator, and Y(t,m) is the potential outcome variable.
Adding these two gives us the <span class="emph">total effect</span>. In the case of continuous variables, the values for 'control' and 'treatment' are fixed to specific values of interest, but by default represent a unit difference in the values of the variable. The results of the <span class="func">mediate</span> function are based on a simulation approach, and averaged over those simulations. You'll note there is also a 'proportion of effect mediated'. This is the size of the ACME to the the total effect. It too is averaged over the simulations (and thus is not simply the ACME/Total Effect reported values). It's not clear to me how useful of a notion this is, and as noted by the boundaries, it's actually a ratio as opposed to a true proportion (as calculated), such that under some simulations the value can be negative or greater than 1.
You may compare the result with the previous <span class="pack">lavaan</span> results. I've rerun it with bootstrap estimates (set `se='boot'`) to be more in keeping with the simulation approach of Imai[^bootmed]. You could also use the <span class="pack">blavaan</span> package to get true Bayesian estimates.
```{r imai_vs_lavaan, echo=F, cache=T}
mediationModel = sem(model, data=mcclelland, se='boot')
# lavaan estimate
parameterestimates(mediationModel) %>%
filter(label=='att4_read21' | (lhs =='read21' & rhs=='attention4')) %>%
mutate_if(is.numeric, round, digits=3) %>%
arrange(lhs)
# comparison sims
# coefs = mvrnorm(1000, mu=coef(outcome_model), Sigma=vcov(outcome_model))
# de = matrix(c(1,0,0,0,1,0,1,0), ncol=4, byrow=T) %*% t(coefs)
# mean(apply(de, 2, diff))
```
As expected, the results of <span class="pack">lavaan</span> and <span class="pack">mediation</span> packages largely agree with one another. However, the potential outcomes framework has the *product-of-paths* approach as a special case, and thus generalizes to different models and types of variables for the mediator and outcome.
##### Explicit Example
While we have shown how to use the tool and compared the outputs, let's make things more simple and explicit to be sure that we understand what's going on. The following closely follows [Pearl](http://ftp.cs.ucla.edu/pub/stat_ser/r370.pdf) (section 4.4)[^docalculus]. Consider the following three variable setup with estimated parameters. We have a treatment variable that is dummy coded with 0 = 'Control group' and 1 = 'Treatment group', a continuous Mediator and Target variable.
```{r cfgraph, echo=FALSE}
tags$div(style="width:500px; margin:0 auto; font-family:Roboto; font-size:50%",
DiagrammeR::DiagrammeR("scripts/counterfact.gv", type='grViz', width=500, height=250))
```
<br>
In addition, we'll consider just a single individual, whose observed values are 'control', i.e. 0, 3 for the mediator, and 2 for the target. We define the unique values $\epsilon_*$ that represent invariant characteristics that exist regardless of changes to interventions or counterfactual hypotheses and set them to the following:
$\epsilon_1 = 0$
$\epsilon_2 = 3 - 0*.3 = 3$
$\epsilon_3 = 2 - 0*.5 - 3*.2 = 1.4$
If a person is in the control group, this setup suggests the predicted value for the Target would be 3\*.2 + 1.4 = 2. Let's now figure out what the value would be if this person was in the treatment. The Mediator value would now be 3 + 1\*.3 = 3.3. This would in turn result in a predicted Target value as follows: 1.4 + 3.3\*.2 + 1\*.5 = 2.56. And we could take this approach for other situations. We could, for example, ask what the expected value be if the Treatment level is 0, but the Mediator was at whatever value it would have been, had the person received the treatment? The value is 3.3\*.2 + 1.4 = 2.06, which if we subtract from when they are are fully in the control group gives us the mediated effect 2.06 - 2 = .06. In this linear model it is equivalent to indirect effect via the product-of-paths approach, .3\*.2 = .06. But now with the estimate based on the target variable rather than coefficients, we don't have to be restricted to linear models.
```{r simmed, echo=F, eval=F}
set.seed(123)
N = 10000
Treatment = rep(0:1, length.out = N)
Mediator = 3 + .3*Treatment + rnorm(N, sd=.1)
Target = 0 + .5*Treatment + .2*Mediator + rnorm(N, sd=.1)
d = data_frame(Treatment, Mediator, Target)
cf = '
Mediator ~ Treatment
Target ~ Mediator + Treatment
'
cfmod = sem(cf, data=d)
# summary(cfmod)
# semPlot::semPaths(cfmod, what = 'est', style = 'lisrel', nDigits = 1, layout = 'tree2', rotation = 2)
library(mediation)
mediator_model = lm(Mediator ~ Treatment, data=d)
outcome_model = lm(Target ~ Treatment + Mediator, data=d)
med_result = mediate(model.m = mediator_model, model.y = outcome_model, treat = 'Treatment', mediator = 'Mediator')
summary(med_result)
dmed = data_frame(d0_sim=c(med_result$d0.sims), d1_sim=c(med_result$d0.sims))
dmed %>%
gather(key=situation, value=effect) %>%
ggplot(aes(x=effect, group=situation)) +
geom_density(aes(fill=situation))
```
One thing to note about the above, despite not having a data set or statistical result, we engaged in attempting to understand causal relations among the variables under theoretical consideration. Even though we may use the tools of statistical analysis, the key things that separates SEM and common regression approaches is the *explicit* assumption of causal relations and the ability to test some causal claims[^bollenpearl].
#### Cavets about indirect effects
One should think very hard about positing an indirect effect, especially if there is no time component or obvious pseudo-causal path. If the effect isn't immediately obvious, or you are dealing with cross-sectional data, then you should probably just examine the direct effects. Unlike other standard explorations one might do with models (e.g. look for nonlinear relationships), the underlying causal connection and associated assumptions are more explicit in this context. Many models I've seen in consulting struck me as arbitrary as far as which covariate served as the mediator, required a fairly convoluted explanation for its justification, or ignored other relevant variables because the reasoning would have to include a plethora of indirect effects if it were to be logically consistent. Furthermore, I can often ask one or two questions and will discover that researchers are actually interested in interactions (i.e. moderation), rather than indirect effects.
This document will not get into models that have *moderated mediation* and *mediated moderation*. In my experience these are often associated with models that are difficult to interpret at best, or are otherwise not grounded very strongly in substantive concerns[^modmed]. However, there are times, e.g. in experimental settings, which surprisingly few SEM are applied to for reasons unknown, where perhaps it would be very much appropriate. It is left to the reader to investigate those particular complexities when the time comes.
Too often investigators will merely throw in these indirect effects and simply report whether the effect is significant or not. If you hide their results and ask if the indirect effect is positive or negative, I suspect they might not even know. Even if not used, I think the perspective of the potential outcomes framework can help one think appropriately about the situation, and avoid unnecessary complications. If one wants to *explore* possible indirect paths, there are better [approaches][Bayesian Networks] which are still rooted within the structural causal model framework, which we'll see shortly. In addition, a mediation model should always be compared to a simpler model with no mediation (e.g. via AIC).
#### Terminology issues
Some refer to models with indirect effects as <span class="emph">mediation</span> models[^kline_indvmed], and that terminology appears commonly in the SEM (and esp. psychology) literature[^buffer]. Many applied researchers just starting out with SEM often confuse the term with <span class="emph">moderation</span>, which is called an <span class="emph">interaction</span> in every other statistical modeling context. As you begin your SEM journey, referring to <span class="emph">indirect effects</span> and <span class="emph">interactions</span> will likely keep you clear on what you're modeling, and perhaps be clearer to those who may not be as familiar with SEM when you present your work.
In models with moderation, one will often see a specific variable denoted as the *moderator*. However, this is completely arbitrary in the vast majority of situations in which it is used. In a typical model that includes an interaction, it makes just as much sense to say that the $\mathcal{A} \rightarrow \mathcal{Y}$ relationship varies as a function of $\mathcal{B}$ as it does the $\mathcal{B} \rightarrow \mathcal{Y}$ relationship varies as a function of $\mathcal{A}$. Some suggest using the term *moderation* when only when there is a potential causal relation, almost always temporal in nature, such that the moderator variable precedes the variable in question. In addition, the moderator and the variable moderated must be *uncorrelated*[^wuzumbo]. While such restrictions would render almost every model I've ever seen that uses the term moderation as an ordinary interaction instead, the idea is that the moderator has a causal effect on the path between the variable in question and its target. This is why you see graphs like the one to the left below, which has no tie to any statistical model and generally confuses people more often than it illuminates (the *correct* graph is to the right). Again though, the actual model is statistically indistinguishable from a standard model including an interaction, so if you want to understand moderation, you simply need to understand what an interaction is.
<!-- <img src="img/moderation.png" style="display:block; margin: 0 auto; width:75%"> -->
```{r modpic, echo=FALSE, eval=F, cache=F, fig.align='right', out.width='50%'}
# whatever code diagrammer is producing for this will just screw any number of things up, including subsequent graphs; even running just one of them separately it will display a previous graph instead of the correct one. Have given up
# tags$div(style="width:500px; margin:0 auto; font-family:Roboto; font-size:50%",
# DiagrammeR::DiagrammeR("scripts/moderation0.gv", type='grViz', width=400, height=200)
# )
#
# tags$div(style="width:500px; margin:0 auto; font-family:Roboto; font-size:50%",
# DiagrammeR::DiagrammeR("scripts/moderation.gv", type='grViz', width=400, height=200)
# )
```
<div class='col2' text-align='center'>
<img src="img/mod_problematic.png" style="display:block; margin: 0 auto; float:left;" width=75%>
<img src="img/mod_better.png" style="display:block; margin: 0 auto; float:left;" width=75%>
</div>
<br>
Like SEM relative to standard modeling techniques, so too is mediation and moderation to indirect effects and interactions. Statistically they are indistinguishable models. Theoretically however, there is an underlying causal framework, and in some cases specific causal claims can be tested (e.g. via simulation).
Recently I have also heard the phrase 'spotlight analysis' to refer to a categorical by continuous variable interaction. Please don't apply new terms to very old techniques in an attempt to make a model sound more interesting[^psychlit]. The work should speak for itself and be as clearly expressed as possible.
The gist is that if you want to talk about your relations using the terms 'mediation' and 'moderation' you'll want to have specific and explicit causal relations among the variables. If you are unsure, then the answer is clearly that you don't have that situation. However, you can still incorporate interactions as usual, and explore potential indirect effects by modeling the correlations, e.g. leaving the indirect path undirected.
#### Aside: Instrumental variables
Path analysis of a different sort has a long history in econometrics. A very common approach one may see utilized is <span class="emph">instrumental variable</span> analysis. The model can be depicted graphically as follows:
```{r ivmodel, echo=FALSE, eval=F, cache=FALSE, fig.align='center'}
tags$div(style="width:500px; margin:0 auto; font-family:Roboto; font-size:50%",
DiagrammeR::grViz('scripts/iv.gv', width=500, height=250))
```
<img src="img/iv.png" style="display:block; margin: 0 auto;" width=40%>
In the graph, $\mathcal{Y}$ is our target variable, $\mathcal{X}$ is the causal variable of interest, and $\mathcal{Z}$ the instrument. The instrumental variable must not be correlated with disturbance of the target variable (later depicted as $\mathcal{U}$), and only has an indirect relation to $\mathcal{Y}$ through $\mathcal{X}$. Neither target nor causal variable are allowed to effect the instrument directly or indirectly. In graph theory, this involves the notion of conditional independence we've mentioned before, where $\mathcal{Z}$ is conditionally independent of Y given X, or in other words, if we remove the $\mathcal{X} \rightarrow \mathcal{Y}$ path, $\mathcal{Z}$ and $\mathcal{Y}$ are independent. Furthermore, $\mathcal{Z}$ is *not* conditionally independent of $\mathcal{X}$ with the removal of that path[^direct].
The motivating factor for such a model is that the simple $\mathcal{X} \rightarrow \mathcal{Y}$ path may not provide a means for determining a causal relationship, particularly if $\mathcal{X}$ is correlated with the error. As an example[^CamTriv], if we look at the relationship of earnings with education, the unspecified causes of earnings might include things, e.g. some general 'ability', that would be correlated with the potential causal variable of education attained. With error $\mathcal{U}$ we can show the following diagram that depicts the problem.
```{r ivxyu, echo=FALSE, eval=TRUE, cache=FALSE, fig.align='center'}
tags$div(style="width:500px; margin:0 auto; font-family:Roboto; font-size:50%",
DiagrammeR::grViz('scripts/ivxyu.gv', width=500, height=250))
```
<br>
When $\mathcal{X}$ is endogenous, that is not the case. Changes in $\mathcal{X}$ are associated with $\mathcal{Y}$, but also $\mathcal{U}$.
<br>
```{r ivxyu2, echo=FALSE, eval=TRUE, cache=FALSE, fig.align='center'}
tags$div(style="width:500px; margin:0 auto; font-family:Roboto; font-size:50%",
DiagrammeR::grViz('scripts/ivxyu2.gv', width=500, height=250))
```
<br>
There are now multiple effects of $\mathcal{X}$ on $\mathcal{Y}$, both directly from $\mathcal{X}$ and indirectly from $\mathcal{U}$. We can measure the association between $\mathcal{X}$ and $\mathcal{Y}$, but not the causal relationship.
An attempt can be made to remedy the problem by using an instrument, leading the following graph.
<br>
```{r ivxyzu, echo=FALSE, eval=TRUE, cache=FALSE, fig.align='center'}
tags$div(style="width:500px; margin:0 auto; font-family:Roboto; font-size:50%",
DiagrammeR::grViz('scripts/ivxyzu.gv', width=500, height=250))
```
<br>
The instrument $\mathcal{Z}$ affects $\mathcal{X}$ and thus is correlated with $\mathcal{Y}$ indirectly. However, unlike $\mathcal{X}$, it is uncorrelated with $\mathcal{U}$. This assumption excludes $\mathcal{Z}$ from having a direct effect in the
model for $\mathcal{Y}$. If instead $\mathcal{Y}$ depended on both $\mathcal{X}$ and $\mathcal{Z}$ but only $\mathcal{X}$ has a direct path to $\mathcal{Y}$, then $\mathcal{Z}$ is basically absorbed into the error so would be correlated with it. With the direct path from $\mathcal{Z}$ to $\mathcal{Y}$ not allowed, we are left with only its influence on $\mathcal{X}$ .
Such models were commonly fit with a technique known two-stage least squares, which is demonstrated here. I use a subset of the <span class="objclass">CigarettesSW</span> data set and <span class="func">ivreg</span> function from the <span class="pack">AER</span> package for comparison. The data concerns cigarette consumption for the 48 continental US States from 1985–1995, though only the last year is used for our purposes. Here we examine the effect on packs per capita (packs) of price (`rprice`), using state level income (`rincome`) as the instrumental variable. See the help file for the <span class="objclass">CigarettesSW</span> for more on this data.
```{r ivreg, echo=-c(1:7, 10:11, 22), cache=TRUE}
# library(AER) # don't use AER directly since AER loads several plus their deps including MASS
data("CigarettesSW", package = "AER")
CigarettesSW$rprice = with(CigarettesSW, price/cpi)
CigarettesSW$rincome = with(CigarettesSW, income/population/cpi)
CigarettesSW$tdiff = with(CigarettesSW, (taxs - tax)/cpi)
Cigarettes = select(CigarettesSW, state, rprice, rincome, tdiff, tax, cpi, packs, year) %>%
filter(year=='1995')
head(Cigarettes) # raw, logged values actually used in models
Ciglog = select(Cigarettes, packs, rprice, rincome) %>%
mutate_all(log)
Cigarettes = Ciglog
# explicit 2sls
mod0 = lm(rprice ~ rincome , data = Cigarettes)
lm(packs ~ fitted(mod0), data = Cigarettes)
## using AER package
ivmod = AER::ivreg(packs ~ rprice | rincome, data = Cigarettes)
summary(ivmod)
ymod = lm(packs ~ rincome, data = Cigarettes)
```
We can see that these both produce the same results, though actually doing the piecemeal regression would lead to problematic standard errors. However, by using the fitted values from the regression of $\mathcal{X}$ on $\mathcal{Z}$, we are now able to essentially get the effect of $\mathcal{X}$ untainted by $\mathcal{U}$. To see more clearly where the effect comes from, the coefficient of regressing `packs` on `income` is `r round(coef(ymod)[2], 3)`, and dividing that by the coefficient of `price` on `income`, `r round(coef(mod0)[2], 3)`, will provide the ultimate effect of price on the pack of cigarettes that we see above.
We can run the path model in <span class="pack">lavaan</span> as well. Note the slight difference between what we'd normally code up for a path model. In order to get the same results as instrumental variable regression, we have to explicitly add an estimate of the packs disturbance with price.
```{r lavaan_ivreg, cache=TRUE}
mod = '
packs ~ rprice
rprice ~ rincome
packs ~~ rprice
'
modiv = sem(mod, data=Ciglog, meanstructure = T, se = 'robust')
summary(modiv)
```
While instrumental variable regression is common, this is perhaps not a typical path model seen in SEM. With SEM, a model comparison approach might be taken to compare models of competing complexity. One issue not discussed is where instruments come from. Generally data is not collected with acquiring them in mind, so it might be very difficult to use such an analysis. Furthermore, the structured causal framework that includes SEM would generally be a more flexible approach.
##### Heckman Selection Model
A different kind of instrumental variable model comes into play with sample selection models. Very common in econometrics or those exposed to it, and practically nowhere else, in this situation, our dependent variable is censored in some fashion, such that we only observe some the population. The classic, and evidently only, textbook example you'll find on this regards the wages of working women. Here is the description from the Stata manual.
> In one classic example, the first equation describes the wages of women. Women choose whether to work, and thus, from our point of view as researchers, whether we observe their wages in our data. If women made this decision randomly, we could ignore that not all wages are observed and use ordinary regression to fit a wage model. Such an assumption of random participation, however, is unlikely to be true; women who would have low wages may be unlikely to choose to work, and thus the sample of observed wages is biased upward. In the jargon of economics, women choose not to work when their personal reservation wage is greater than the wage offered by employers. Thus women who choose not to work might have even higher offer wages than those who do work—they may have high offer wages, but they have even higher reservation wages. We could tell a story that competency is related to wages, but competency is rewarded more at home than in the labor force.
So the primary problem here again is bias. Now you may say- but I'm not interested in the generalizing to all women[^modelmissing], or, if I were really worried about exact inference there are a whole host of other things being ignored, but the economists will have none of it[^vogons]. Conceptually, the model is as follows:
$$ y = X\beta + \epsilon$$
$$\mathrm{observed} = Z\gamma + \nu$$
$$\begin{pmatrix}
\epsilon \\ \nu
\end{pmatrix}
\sim N\begin{bmatrix}
\begin{pmatrix}
0 \\ 0
\end{pmatrix},
\begin{pmatrix}
\sigma & \rho\\
\rho & 1
\end{pmatrix}\end{bmatrix}$$
There are actually two models to consider- one that we care about, and one regarding whether we observe the target variable $y$ or not. As before, we have the correlated errors issue, and, as with instrumental variable regression, we can take a two step approach here, albeit a bit differently. In the one model, we will predict whether the data is observed or not using a probit model. Then we'll use a variable based on the fits from that model in the next step. Note these are not the fitted values, but the <span class="emph">inverse Mills ratio</span>, or in survival model lingo, the nonselection hazard, based on them. I show results using the <span class="pack">sampleSelection</span> package, and how to duplicate the process with standard R model functions.
```{r heckman}
wages = haven::read_dta('data/heckman_women.dta')
library(sampleSelection)
wages$select_outcome = factor(!is.na(wages$wage), labels=c('not', 'observed'))
selection_model = selection(select_outcome ~ married + children + education + age,
wage ~ education + age,
data=wages, method='2step')
summary(selection_model)
# probit model
probit = glm(select_outcome ~ married + children + education + age,
data=wages,
family=binomial(link = 'probit'))
# setup for wage model
probit_lp = predict(probit)
mills0 = dnorm(probit_lp)/pnorm(probit_lp)
imr = mills0[wages$select_outcome=='observed']
lm_select = lm(wage ~ education + age + imr, data=filter(wages, select_outcome=='observed'))
summary(probit)
summary(lm_select)
```
Dividing the coefficient for `imr` by $\sigma$ will give you a rough estimate of $\rho$, but the $\sigma$ from the package results is the <span class="func">lm</span> <span class="objclass">sigma</span> plus an extra component, so the rough estimate of $\rho$ from just the <span class="func">lm</span> results will be a bit larger.
Unfortunately there isn't a real good way to formulate this in SEM. Conceptually it's still an instrumental variable type of model, but the complete separation of the data, where part of the model only exists for one category means that standard approaches won't estimate it. Probably a good thing in the end. You could however estimate the IMR, and do the second step only using robust standard errors. It still won't be exactly the same as the <span class="pack">sampleSelection</span> result, but close. If you'd like to see more detail, check [this out](https://github.com/m-clark/Miscellaneous-R-Code/blob/master/ModelFitting/heckman_selection.R).
```{r heckman_misc, echo=FALSE, eval=FALSE}
heckmodel_sem = '
wage ~ education + age + select_outcome
select_outcome ~ married + children + education + age
wage ~~ select_outcome
select_outcome ~~ 1*select_outcome
'
```
#### Aside: Nonrecursive Models
Recursive models have all unidirectional causal effects and disturbances are not correlated. A model is considered nonrecursive if there is a reciprocal relationship, feedback loop, or correlated disturbance in the model[^nonrecursiveName]. Nonrecursive models are potentially problematic when there is not enough information to estimate the model (unidentified model), which is a common concern with them.
A classic example of a nonrecursive relationship is marital satisfaction: the more satisfied one partner is, the more satisfied the other, and vice versa (at least that's the theory). This can be represented by a simple model (below).
```{r nonRecursive_old, eval=FALSE, echo=F, fig.align='center', fig.height=2, dev='svg'}
WifeSatisfaction = rnorm(100)
HusbandSatisfaction = rnorm(100)
WifeEdu = rnorm(100)
HusbandEdu = rnorm(100)
FamIncome = rnorm(100)
model1 = '
WifeSatisfaction ~ .25*HusbandSatisfaction
HusbandSatisfaction ~ .5*WifeSatisfaction
WifeSatisfaction ~~ HusbandSatisfaction
HusbandSatisfaction ~~ 1*HusbandSatisfaction
WifeSatisfaction ~~ 1*WifeSatisfaction
'
m1 = sem(model1, data=data.frame(WifeSatisfaction, HusbandSatisfaction))
semPlot::semPaths(m1, layout='tree2', style='lisrel',rotation=2, covAtResiduals=F,
# whatLabels = 'par',
sizeMan=10, edge.label.cex=1.5, color='#ff5500', borders=F,
label.color='#ffffff') # simply doesn't plot nonrecursive models correctly
```
```{r nonrecursive, echo=F, cache=FALSE}
tags$div(style="width:500px; margin:0 auto; font-family:Roboto; font-size:50%",
DiagrammeR::grViz('scripts/nonrecursive.gv', width=500, height=250)
)
```
<br>
Such models are notoriously difficult to specify in terms of *identification*, which we will talk more about later. For now, we can simply say the above model would not even be estimated as there are more parameters to estimate (two paths, two variances) than there is information in the data (two variances and one covariance).
To make this model identified, we use <span class="emph">instrumental variables</span> as we did before. In this scenario, instrumental variables directly influence one of the variables in a recursive relationship, but not the other. For example, a wife’s education can influence her satisfaction directly and a husband’s education can influence his satisfaction directly, but a husband’s education cannot directly impact a wife’s satisfaction and vice versa (at least for this demonstration). These instrumental variables can indirectly impact a spouses’ satisfaction. In the following model family income is also included. The dashed line represents an unanalyzed correlation.
```{r nonRecursive2_old, eval=FALSE, echo=F, fig.align='center', dev='svg'}
model2 = '
WifeSatisfaction + HusbandSatisfaction ~ FamIncome
FamIncome ~ WifeEdu + HusbandEdu
WifeSatisfaction ~ HusbandSatisfaction + WifeEdu
HusbandSatisfaction ~ WifeSatisfaction + HusbandEdu
'
m2 = sem(model2, data=data.frame(WifeSatisfaction, HusbandSatisfaction, WifeEdu, HusbandEdu, FamIncome))
semPlot::semPaths(m2, layout='tree2', rotation=2, style='lisrel', curvePivot=T, nCharNodes=3,
sizeMan=10, edge.label.cex=1.5, color='#ff5500', borders=F,
label.color='#ffffff') # simply doesn't plot res cov
```
```{r nonrecursive2, echo=FALSE, cache=FALSE}
tags$div(style="width:500px; margin:0 auto; font-family:Roboto; font-size:50%",
DiagrammeR::grViz('scripts/nonrecursive2.gv', width=500, height=250)
)
```
<br>
Many instances of nonrecursive models might better be represented by a correlation. One must have a very strong theoretical motivation for such models, which is probably why they aren't seen as often in the SEM literature, though they are actually quite common in some areas such as economics, where theory is assumed to be stronger (by economists).
#### Aside: Tracing rule
In a recursive model, <span class="emph">implied correlations</span> between two variables, X1 and X2, can be found using <span class="emph">tracing rules</span>. Implied correlations between variables in a model are equal to the sum of the product of all standardized coefficients for the paths between them. Valid tracings are all routes between X1 and X2 that a) do not enter the same variable twice, and b) do not enter a variable through an arrowhead and leave through an arrowhead. The following examples assume the variables have been standardized (variance values equal to 1), if standardization has not occurred the variance of variables passed through should be included in the product of tracings.
Consider the following variables, A, B, and C (in a data frame called <span class="objclass">abc</span>) with a model seen in the below diagram. We are interested in identifying the implied correlation between x and z by decomposing the relationship into its different components and using tracing rules.
```{r traceCoefsInit, echo=F}
# because of MASS screwing up other packages;
cormat = lazerhawk::create_corr(c(.2,.3,.7))
abc = MASS::mvrnorm(100, mu=rep(0,3), Sigma=cormat, empirical=TRUE)
colnames(abc) = LETTERS[1:3]
# psych::describe(abc)
```
<!-- something about semplot is screwy too, so don't cache it either -->
```{r traceCoefs, echo=1:6, cache=F, eval=T}
cor(abc)
model = "
C ~ B + A
B ~ A
"
pathMod = sem(model, data=abc)
coef(pathMod)
# semPlot::semPaths(pathMod, whatLabels = 'par', style = 'lisrel',
# sizeMan=10,
# edge.label.cex=1.25, edge.color='gray50',
# color='#1f65b780',
# borders=T, border.color='gray80', border.width=2,
# label.color='#ffffff', bg='transparent')
# starter code if you want to switch to Diagrammer
# lavaanPlot(model = pathMod,
# node_options=list(fontsize=6,
# color='gray80',
# style='filled',
# fontcolor='white',
# fillcolor='"#1f65b780"',
# fixedsize=2),
# edge_options = list(color = "gray50"))
```
<img src="img/trace_plot.png" style="display:block; margin: 0 auto;">
To reproduce the correlation between A and C (sometimes referred to as a 'total effect'):
- Corr = ac + ab * ac
- Corr = `r paste(round(coef(pathMod)[1],3), '+', round(prod(coef(pathMod)[2:3]),3))`
- Corr = `r coef(pathMod)[1] + prod(coef(pathMod)[2:3])`
In SEM models, it's important to consider how well our model-implied correlations correspond to the actual observed correlations. For *over-identified* models, the correlations will not be reproduced exactly, and thus can serve as a measure of how well our model fits the data. We'll discuss this more later. In addition, the tracing of paths is important in understanding the <span class="emph">structural causal models</span> approach of Judea Pearl, of which SEM and the potential outcomes framework are part of.
## Bayesian Networks
In many cases of path analysis, the path model is not strongly supported by prior research or intuition, and in other cases, people are willing to use <span class='emph'>modification indices</span> after the fact to change the paths in their model. This is unfortunate, as their model is generally *overfit* to begin with, and more so if altered in such an ad hoc fashion.
A more exploratory approach to graphical models is available however. <span class="emph">Bayesian Networks</span>, an approach within Pearl's structural causal model framework and which led to its development, are an alternative to graphical modeling of the sort we've been doing. Though they can be used to produce exactly the same results that we obtain with path analysis via maximum likelihood estimation, they can also be used for constrained or wholly exploratory endeavors as well, potentially with regularization in place to help reduce overfitting.
As an example, I use the McClelland data to explore potential paths via the <span class='pack'>bnlearn</span> package. I make the constraints that variables later in time do not effect variables earlier in time[^time_cause], no variables are directed toward background characteristics like sex, and at least for these purposes I keep math and reading at a particular time from having paths to each other. I show some of the so-called *blacklist* of constraints here. Note that we can also require certain paths be in the model via a *whitelist*.
```{r blacklist, echo=F, eval=-c(3:10), cache=T}
mcclellandNoCollege = select(mcclelland, -college, -college_missing) %>%
na.omit %>%
sapply(as.numeric) %>%
data.frame
blacklist = data.frame(from = c(rep(c('read21', 'math21'), each=7), rep(c('read7','math7'), each=5),
rep(c('vocab4', 'attention4'), each=3),
rep(c('adopted', 'male', 'momed'), times=2),
'math21','read21', 'math7','read7'),
to = c(rep(c('read7', 'math7', 'attention4', 'vocab4', 'adopted', 'male', 'momed'), times=2),
rep(c('attention4', 'vocab4', 'adopted', 'male', 'momed'), times=2),
rep(c('adopted', 'male', 'momed'), times=2),
c('male', 'momed', 'adopted'), c('momed', 'adopted', 'male'),
'read21', 'math21','read7','math7'))
blacklist2 = data.frame(from = c(rep(c('read21', 'math21'), each=7), rep(c('read7','math7'), each=5),
rep(c('vocab4', 'attention4'), each=3),
rep(c('adopted', 'male', 'momed'), times=2)),
to = c(rep(c('read7', 'math7', 'attention4', 'vocab4', 'adopted', 'male', 'momed'), times=2),
rep(c('attention4', 'vocab4', 'adopted', 'male', 'momed'), times=2),
rep(c('adopted', 'male', 'momed'), times=2),
c('male', 'momed', 'adopted'), c('momed', 'adopted', 'male')))
# library(bnlearn)
modelgs = bn.cv(mcclellandNoCollege, 'gs', algorithm.args = list(blacklist = blacklist, test='mi-g-sh'))
modeliamb = bn.cv(mcclellandNoCollege, 'iamb', algorithm.args = list(blacklist = blacklist, test='mi-g-sh'))
modelinter = bn.cv(mcclellandNoCollege, 'inter.iamb', algorithm.args = list(blacklist = blacklist, test='mi-g-sh'))
plot(modelgs, modeliamb, modelinter)
head(blacklist)
```
Now we can run the model. In this case we'll use the *Grow-Shrink* algorithm, which is one of the simpler available in the package, though others came to the same conclusion[^mcc_conclusion].
```{r bn, echo =1:2, fig.width=5, fig.height=5, fig.align='center', dev='svg', cache=F}
library(bnlearn)
model = gs(mcclellandNoCollege, blacklist = blacklist, test='mi-g-sh')
# plot(model)
library(visNetwork); library(igraph)
g = graph_from_edgelist(model$arcs)
g = toVisNetworkData(g)
# there is no way to use circle without the label fucking the size up
visNetwork(nodes=data.frame(g$nodes, value=100), edges=g$edges, width='75%') %>%
visNodes(shadow=list(enabled=T, color='rgba(0,0,0,0.25)'),
font=list(size=12, color='#fff'),
shape='circle',
scaling=list(label=list(enabled=T, min=10, max=12), max=5),
color=list(background='#ff5500', highlight=list(background='salmon'))) %>%
visEdges(color='#00aaff', arrows='to', smooth=list(enabled=T, forceDirection='horizontal')) %>%
visLayout(randomSeed=123)
# bn.fit(model, data=mcclellandNoCollege) # extract parameters, not shown
# model = gs(mcclellandNoCollege, blacklist = blacklist, test='mi-g-sh', undirected=TRUE)
# plot(model)
```
The plot of the model results shows that attention span at age 4 has no useful relationship to the other variables, something we'd already suspected based on previous models, and even could guess at the outset given its low correlations. As it has no connections, I've dropped it from the visualization. Furthermore, the remaining paths make conceptual sense. The parameters, fitted values, and residuals can be extracted with the <span class='func'>bn.fit</span> function, and other diagnostic plots, cross-validation and prediction on new data are also available.
We won't get into the details of these models except to say that one should have them in their tool box. And if one really is in a more exploratory situation, they would typically come with methods far better suited for the situation than the SEM tools, none of which I come across consider predictive ability very strongly. The discovery process with Bayesian networks can also be a lot of fun. Even if one has strong theory, nature is always more clever than we are, and you might find something interesting. See also the <span class="pack">bnpa</span>, <span class="pack">autoSEM</span>, and <span class="pack">regsem</span> packages for more.
## Undirected Graphs
So far we have been discussing directed graphs in which the implied causal flow tends toward one direction and there are no feedback loops. However, sometimes the goal is not so much to estimate the paths as it is to find the structure. <span class='emph'>Undirected graphs</span> simply specify the relations of nodes with edges, but without any directed arrows regarding the relationship. Such graphs are sometimes called <span class="emph">Markov Random Fields</span>.
While we could have used the <span class="pack">bnlearn</span> package for an undirected graph by adding the argument `undirected = T`, there are a slew of techniques available for what is often called <span class='emph'>network analysis</span>. Often the focus is on *observations*, rather than variables, and what influences whether one sees a tie/edge between nodes, with modeling techniques available for predicting ties (e.g. Exponential Random Graph models). Often these are undirected graphs, and that is our focus here, but they do not have to be.
### Network analysis
Networks can be seen everywhere. Personal relationships, machines and devices, various business and academic units... we can analyze the connections among any number of things. A starting point for a very common form of network analysis is an <span class='emph'>adjacency matrix</span>, which represents connections among items we wish to analyze. Often it contains just binary 0-1 values where 1 represents a connection between nodes. Any similarity matrix could potentially be used (e.g. a correlation matrix). Here is a simple example of an adjacency matrix:
```{r exampleAdjacency, echo=FALSE}
set.seed(1234)
adj = lazerhawk::create_corr(sample(0:1, 15, replace=TRUE, p=c(.6,.4)))
rownames(adj) = colnames(adj) = sort(c('Lagia','Nancy', 'David', "J'Sean",'Bernadette', 'Mancel'))
pander::pander(adj)
```
Visually, we can see the connections among the nodes.
```{r exampleAdjacencyVisualized, echo=FALSE, cache=FALSE}
# networkd3
# library(networkD3)
# library(scales)
#
# Links = tidyr::gather(data.frame(Source=rownames(adj), adj), key='Target', value='Value', -Source) %>%
# mutate(Target = factor(Target, levels=levels(Source)),
# Source = as.numeric(Source)-1,
# Target = as.numeric(Target)-1) %>%
# filter(Source!=Target) %>%
# # arrange(Source) %>%
# filter(Value==1)
# Nodes = data.frame(ID=rownames(adj), Group=1,Size=12)
# forceNetwork(Links=Links, Nodes=Nodes, Source='Source', Target='Target',
# Value='Value', NodeID='ID', Group='Group', linkColour='#BFBFBF', colourScale='#ff5500',#alpha('gray90',1)
# opacity=.8, fontSize=12, Nodesize='Size')
diag(adj) = 0
g = graph_from_adjacency_matrix(adj, mode='undirected')
visIgraph(g) %>%
visNodes(shadow=list(enabled=T, color='rgba(0,0,0,0.25)'),
font=list(size=12, color='#FF5500'),
shape='text',
scaling=list(label=list(enabled=T, min=10, max=12), max=5),
color=list(background='#ff5500', highlight=list(background='salmon'))) %>%
visEdges(color='#00aaff', smooth=list(enabled=T, forceDirection='horizontal')) %>%
visLayout(randomSeed=123)
```
As an example of a network analysis, let's look at how states might be more or less similar on a few variables. We'll use the <span class='objclass'>state.x77</span> data set in base R. It is readily available, no need for loading. To practice your R skills, use the function <span class='func'>str</span> on the state.x77 object to examine its structure, and <span class='func'>head</span> to see the first 6 rows, and `?` to find out more about it.
Here are the correlations of the variables.
```{r stateCorr, echo=FALSE, cache=FALSE, out.width='10%'}
tags$div(style="margin:0 auto; width:50%;",
corr_heat(cor(state.x77)))
```
<br>
The following depicts a graph of the states based on the variables of Life Expectancy, Median Income, High School Graduation Rate, and Illiteracy. The colors represent the results of a community detection algorithm, and serve to show that the clustering is not merely geographical, though one cluster is clearly geographically oriented ('the South').
```{r stateNetwork, echo=FALSE, cache=FALSE, fig.width=8}
library(magrittr)
d = data.frame(state.x77)
d %<>%
select(Life.Exp, Income, HS.Grad, Illiteracy)
distmat = as.matrix(dist(scale(d), diag=T, upper=T))
adj = 1/distmat
# networkd3
# library(networkD3)
# library(scales)
#
# stateLinks = tidyr::gather(data.frame(Source=rownames(adj), adj), key='Target', value='Value', -Source) %>%
# mutate(Target = stringr::str_replace_all(Target, '\\.', ' '), # since Hadley decided to put . for spaces without being asked
# Target = factor(Target, levels=levels(Source)),
# Source = as.numeric(Source)-1,
# Target = as.numeric(Target)-1) %>% #Value = round(Value*10)
# filter(Value!=Inf, Value>=.6)
# stateNodes = data.frame(ID=rownames(d), Region=as.numeric(factor(state.region)),
# Division=as.numeric(factor(state.division)))
# forceNetwork(Links=stateLinks, Nodes=stateNodes, Source='Source', Target='Target',
# Value='Value', NodeID='ID', Group='Region', linkColour='#BFBFBF', #alpha('gray90',1)
# opacity=.8, colourScale='d3.scale.category10()', fontSize=12, bounded=T, zoom=T, width=750)
diag(adj) = 0
adj[upper.tri(adj)] = 0
# g = graph_from_adjacency_matrix(adj[adj>=.6], mode='undirected', weighted=T)
edges = tidyr::gather(data.frame(Source=rownames(adj), adj), key='Target', value='Value', -Source) %>%
mutate(Target = stringr::str_replace_all(Target, '\\.', ' ')) %>%
filter(Value!=Inf, Value>=.6) %>%
rbind(., data.frame(Source=c('Alaska', 'Hawaii'), Target=c('Alaska', 'Hawaii'), Value=0.00001))
g = graph_from_data_frame(edges, directed=F)
E(g)$weight = edges$Value
comm = as.vector(membership(cluster_fast_greedy(g))) # cluster_edge_betweenness cluster_walktrap
# V(g)$color = c('coral','#1e90ff0', 'rgba(102, 2, 60,.5)', '#ff550080', 'salmon', '#66023C')[comm]
V(g)$group = comm
V(g)$size = rep(20, 50)
V(g)$label.cex = rep(.75, 50)
visIgraph(g, type='full')
# glist = toVisNetworkData(g)
# glist$nodes$group=comm
# visNetwork(nodes=data.frame(glist$nodes), edges=glist$edges) %>%
# visNodes(shadow=list(enabled=T, color='rgba(0,0,0,0.25)'),
# font=list (size=10, color='rgba(0,0,0,0.5)'),
# # shape='text',
# # scaling=list(label=list(enabled=T, min=10, max=12), max=5),
# group='group') %>%
# visEdges(color='#1e90ff') %>%
# visPhysics(stabilization=T, solver='repulsion') %>%
# visLayout(randomSeed=123)
# save(stateLinks, stateNodes, file='data/statesNetwork')
```
#### Understanding Networks
Networks can be investigated in an exploratory fashion or lead to more serious modeling approaches. The following is a brief list of common statistics or modeling techniques.
##### Centrality
- **Degree**: how many links a node has (can also be 'in-degree' or 'out-degree' for directed graphs)
- **Closeness**: how close a node is to other nodes
- **Betweenness**: how often a node serves as a bridge between the shortest path between two other nodes
- **PageRank**: From Google, a measure of node 'importance'
- **Hub**: a measure of the value of a node's links
- **Authority**: another measure of node importance
Characteristics of the network as a whole may also be examined, e.g. degree distribution, 'clusteriness', average path length etc.
##### Cohesion
One may wish to investigate how network members create communities and cliques. This is similar to standard cluster analysis used in other situations. Some nodes may be isolated.
##### Modeling
- ERGM: exponential random graph models, regression modeling for network data
- Other 'link' analysis
##### Comparison
A goal might be to compare multiple networks to see if they differ in significant ways.
##### Dynamics
While many networks are 'static', many others change over time. One might be interested in this structural change by itself, or modeling something like link loss. See the <span class="pack">ndtv</span> package for some nice visualization of such approaches.
## Summary
Graphical models are a general way to formulate and visualize statistical models. *All* statistical models can be developed within this framework. Structured causal models provide a means to posit causal thinking with graphical models, and structural equation models may be seen as a subset of those. Path analysis within SEM is a form of theoretically motivated graphical model involving only observed variables. These models might include indirect effects and multiple outcomes of interest, and can be seen as an extension of more familiar regression models to such settings. However, path analysis is just one of a broad set of graphical modeling tools widely used in many disciplines, any of which might be useful for a particular data situation.
## R packages used
- <span class="pack">lavaan</span>
- <span class="pack">mediation</span>
- <span class="pack">bnlearn</span>
- <span class="pack">network</span>
[^condind]: There are other assumptions at work also, e.g. that the model is correct and there are no other confounders.
[^pathOld]: Sewall Wright first used path analysis almost 100 years ago.
[^mvreg]: I use *multivariate* here to refer to multiple dependent variables, consistent with *multivariate analysis* generally. Some use it to mean multiple predictors, but since you're not going to see single predictor regression outside of an introductory statistical text, there is no reason to distinguish it. Same goes for *multiple* regression.
[^notmvreg]: Note this is just a syntax shortcut to running multiple models, not an actual 'multivariate' analysis.
[^oldMediation]: In the past people would run separate OLS regressions to estimate mediation models, particularly for the most simple, three variable case. One paper that for whatever reason will not stop being cited is Baron & Kenny 1986. *It was 1986*. Please do not do mediation models like this. You will always have more than three variables to consider, and always have access to R or other software that can estimate the model appropriately. While I think it is very helpful to estimate your models in piecemeal fashion for debugging purposes and to facilitate your understanding, use appropriate tools for the model you wish to estimate. Some packages, such as <span class='pack'>mediate</span>, may still require separate models, but there is far more going on 'under the hood' even then. For more recent work in this area, see the efforts of [Pearl](http://bayes.cs.ucla.edu/jp_home.html) and [Imai](http://imai.princeton.edu/research/index.html) especially.
[^linkedreg]: I've seen some balk at this sort of description, but the fact remains that SEM does not have any more tie to causality than any other statistical tool, other than what applied researchers ascribe to their results. As mentioned before, SEM does not discover causality or allow one to make causal statements simply by conducting it. It's just a statistical model.
[^mccnodescript]: There is a statement "All results controlled for background covariates of vocabulary at age 4, gender, adoption status, and maternal education level." and a picture of only the three-variable mediation model. If you are surprised at this lack of information, you may not be familiar with the field of psychological research.
[^psychlit]: Unless you want to signal to people who are statistically knowledgeable that they should stop paying attention to your work. In general, it's probably best to avoid using the applied literature as a basis for how to conduct or report methods. While excellent examples may be found, many fields have typically ignored their own quantitative methodologists and statisticians for decades. Psychology is rather notorious in this regard, while I might point to political science as a counter example, in the sense that the latter has long had one of its most popular journals a methodological one.
[^kline_indvmed]: Kline distinguishes indirect effects vs. mediation. I don't, because I don't believe SEM to have any inherently greater ability to discern causality than any other framework, and it doesn't. I do like his suggestion that mediation be reserved for effects involving time precedence, and defaulting to 'indirect effects' should keep you out of trouble. But even then, and as physics dictates, the arrow of time is inconsequential on a macro level. All else being equal, your model will fit as well with time 2 $\rightarrow$ time 1 as time 2 $\leftarrow$ time 1.
[^bollenpearl]: Bollen & Pearl (2013) list myth #2 about SEM is that SEM and regression are essentially equivalent. While I agree that we can think of them differently, I don't find their exposition very persuasive. Their explanation of a structural model is no different than what you see for standard regression in econometrics and more rigorous statistics books. It's not clear to me where they think a 'residual' in regression comes from if not from 'unexplained causes' not included in the model (I actually prefer stating regression models as $y \sim \mathcal{N}(X\beta, \sigma^2)$ so as to make the data generating process explicit, a point mostly ignored in SEM depictions). Changing to 'equals' (=) to 'assignment' (:=) in the equation doesn't work if you consider that most programming languages use = for assignment. Adding a picture certainly doesn't change things, and as a consultant, I can say that expressing a model graphically first doesn't keep people from positing models that are very problematic from a causal perspective (and may actually encourage it). <br><br> The gist is that causality is something the researcher, not the data or the model, brings to the scientific table in an explicit fashion, something Bollen & Pearl even state clearly in their discussion of the first myth (that SEM aims to establish causal relations from correlations). Practically speaking, an SEM will produce exactly the same output whether you have strong causal claims or not.
[^modmed]: As an example, see if you can think of a situation in which an interaction between two variables on some outcome, which technically represents a *difference* in coefficient values, would only exist indirectly. Also realize that simply positing such a situation for two variables would mean you'd have to rule out several competing models of all possible direct and indirect paths between A, B, and their product term. Just for giggles you might also consider that either A or B might have more than two categories, and so must be represented by multiple dummy coded variables and multiple interaction terms. Have fun!
[^buffer]: Along with the notion of *buffering*, which is a near useless term in my opinion.
[^wuzumbo]: See for example, Wu & Zumbo (2008) Understanding and using mediators and moderators. ([link](http://link.springer.com/article/10.1007/s11205-007-9143-1))
[^potout_vs_counterfact]: The historical reference for this approach is with Rubin and even earlier with Neyman. Some may make the distinction between potential outcomes, which are *potentially* observable for everyone, and counterfactuals that are inherently unobservable.
[^imai_links]: See the package [vignette](ftp://cran.r-project.org/pub/R/web/packages/mediation/vignettes/mediation.pdf) and Imai's [papers](http://imai.princeton.edu/projects/mechanisms.html) for more detail