-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathREADME.Rmd
253 lines (181 loc) · 6.9 KB
/
README.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
---
output:
github_document:
pandoc_args: --webtex
bibliography: datancode.bib
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(
comment = "#>",
collapse = TRUE,
out.width = "70%",
fig.align = "center",
fig.width = 6,
fig.asp = .618
)
options(digits = 3)
pander::panderOptions("round", 3)
```
# Loglinear Models for Three-way tables
> _Note: This is not an `r` package but just a set of `r` functions aiming at one purpose._
## Overview
`R` codes to select the best models fitting three-way tables.
- `dplyr`
- `broom`
- `plyr`
packages would be used. Install these before use this set.
### `_common.R`
1. define `find_xname()`: used in `fitted_val.R`, `goodness_fit.R`
2. modify `broom::tidy.anova()` function
### `model_threeway.R`
function `model_loglin()`
- fit `glm` for given data
- for _every pair of independence model_
### `goodness_fit.R`
function `good_loglin()`
- for fitted `model_loglin`,
- compute **goodness-of-fit**
- implement with `purrr::map()` and `dplyr::bind_rows()`
### `fitted_val.R`
function `fit_loglin()`
- for fitted `model_loglin`,
- compute **fitted values** for _every pair of independence model_
- implement with `purrr::map()` and `plyr::join_all()`
***
Frow now on, this document will show how these functions can be applied.
1. call `tidyverse` and `broom` libraries, and the other functions with `r/_common.R`
2. fit with `model_loglin()`
3. goodness-of-fit with `good_loglin()`
4. choosing the best model based on the goodness-of-fit statistic
5. compute fitted values with `fitted_val`
## Loading Functions
```{r, message=FALSE, warning=FALSE}
# install.packages("httr")
library(httr)
git_api <- GET("https://api.github.com/repos/ygeunkim/loglinear3/git/trees/master?recursive=1")
stop_for_status(git_api)
repo_list <- unlist(lapply(content(git_api)$tree, "[", "path"), use.names = FALSE)
# R files in r folder -----------------------------
repo_list <- stringr::str_subset(repo_list, pattern = "^r/")
repo_list <- stringr::str_c("https://raw.githubusercontent.com/ygeunkim/loglinear3/master/", repo_list)
# source every file
sapply(repo_list, source)
```
## Beginning
<!-- ```{r} -->
<!-- source("r/_common.R") -->
<!-- ``` -->
`tidyverse` and `broom` are loaded. Here we should modify `tidy` function of @broom. `tidy.anova()` is not defined for `anova.glm`, so we should add more elements in `renamers` in the function. Using the original `tidy` gives `warning message`.
```
renamers <- c(
...
"Ref.df" = "ref.df",
"Deviance" = "deviance", # glm object
"Resid. Df" = "resid.df", # glm object
"Resid. Dev" = "resid.dev" # glm object
)
```
In addition, `find_xname()` function would be used later to construct the various independence models.
## Data: Alcohol, cigarette, and marijuana use {-}
Data from @Agresti:2012aa would be used.
> refers to a survey by the Wright State University School of Medicine and the United Health Services in Dayton, Ohio. The survey asked 2276 students in their final year of high school in a nonurban area near Dayton, Ohio, whether they had ever used alcohol, cigarettes, or maijuana.
```{r, message=FALSE}
(substance <-
read_delim("data/Substance_use.dat", delim = " ") %>%
mutate(alcohol = str_trim(alcohol)) %>%
mutate_if(is.character, factor) %>%
mutate_if(is.factor, fct_rev))
```
Long data format as above is easy to fit loglinear model. Against wide format, i.e. contingency table, try `tidyr::gather()`.
### Change to the long data format
For example, look at the below table.
```{r, message=FALSE}
(aids <- read_table("data/AIDS.dat"))
```
```{r}
aids %>%
gather(yes, no, key = symptom, value = count)
```
Finally, we get the long data.
## Fitting GLMs
### in hand
We can write every formula in hand. However, it is annoying.
```{r}
(subs_hierarchy <-
substance %>%
do(
indep = glm(count ~ alcohol + cigarettes + marijuana, data = ., family = poisson()),
ac_m = glm(count ~ alcohol + cigarettes + marijuana + alcohol:cigarettes,
data = ., family = poisson()),
amcm = glm(count ~ alcohol + cigarettes + marijuana + alcohol:marijuana + cigarettes:marijuana,
data = ., family = poisson()),
acamcm = glm(count ~ alcohol + cigarettes + marijuana + alcohol:cigarettes + alcohol:marijuana + cigarettes:marijuana,
data = ., family = poisson()),
acm = glm(count ~ alcohol * cigarettes * marijuana,
data = ., family = poisson())
))
```
### Defining function
Function can be defined for more general usage. See `r/model_threeway.R`
<!-- ```{r} -->
<!-- source("r/model_threeway.R") -->
<!-- ``` -->
`model_loglin()` fits loglinear model for every pair of independence model.
- `.data`: data
- `yname`: character, response variable
- `glm_fam`: random component of glm, option for `family` of `glm()`. Since we are fitting loglinear model, `poisson()` is set to be default.
It returns `tibble` with list element.
```{r}
(subs_hierarchy <- model_loglin(substance, yname = "count", glm_fam = poisson()))
```
Each list has `glm` object with `[[1]]`.
```{r}
subs_hierarchy$indep[[1]]
```
## Chi-square Goodness-of-fit tests
### Statistic
$$G^2 = 2\sum n_{ijk}\ln\frac{n_{ijk}}{\hat\mu_{ijk}}$$
$$X^2 = \sum\frac{(n_{ijk} - \hat\mu_{ijk})^2}{\hat\mu_{ijk}}$$
with **residual df = the number of cell count - the number of non-redundant parameters**
<!-- ```{r} -->
<!-- source("r/goodness_fit.R") -->
<!-- ``` -->
`good_loglin()` calculates the above statistic for given option `test`. `test = "LRT"` option of `anova.glm()` produces $G^2$. `test = Chisq` gives $X^2$. As noticed, this function _is designed to be_ applied with `purrr::map()` and `dplyr::bind_rows()`.
```{r}
(subs_good <-
subs_hierarchy %>%
map(good_loglin, test = "LRT") %>%
bind_rows()) %>%
pander::pander()
```
### Choosing the best model
From $G^2$, we compare reduced model to complex model
$$G^2(M_0 \mid M_1) = G^2(M_0) - G^2(M_1) \approx \chi^2\Big(df = df(M_0) - df(M_1)\Big)$$
```{r}
subs_good %>%
rename(alternative = model) %>%
group_by(resid.df) %>%
slice(which.min(resid.dev)) %>%
arrange(resid.df) %>%
mutate(
goodness = c(resid.dev[1], -diff(resid.dev)),
df_good = c(resid.df[1], -diff(resid.df))
) %>%
mutate(p_value = pchisq(goodness, df = df_good, lower.tail = FALSE)) %>%
pander::pander()
```
1. `(AC, AM, CM)` vs saturated model `(ACM)`: cannot reject $M_0$ with p-value `0.5408`, so we choose `(AC, AM, CM)`
2. `(A, CM)` vs `(AC, AM, CM)`: reject $M_0$
Thus, **we use the model `(AC, AM, CM)`**
## Fitted values
<!-- ```{r} -->
<!-- source("r/fitted_val.R") -->
<!-- ``` -->
`fit_loglin()` computes fitted values for each independent model. Use this function with `purrr::map()` and `plyr::join_all()`. In `plyr::join_all()`, every variable name including response is written.
```{r}
subs_hierarchy %>%
map(fit_loglin) %>%
plyr::join_all(by = c("count", "alcohol", "cigarettes", "marijuana")) %>%
pander::pander()
```
## References