-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathREADME.Rmd
322 lines (237 loc) · 10.7 KB
/
README.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
---
output:
github_document:
math_method:
engine: webtex
url: https://latex.codecogs.com/gif.image?
pandoc_args: --webtex=https://ibm.codecogs.com/gif.latex?
editor_options:
chunk_output_type: console
---
<!-- README.md is generated from README.Rmd. Please edit that file -->
```{r, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>",
fig.path = "man/figures/README-",
out.width = "100%"
)
```
# defm: Fit and simulate discrete binary exponential family models
<!-- badges: start -->
[![CRAN status](https://www.r-pkg.org/badges/version/defm)](https://CRAN.R-project.org/package=defm)
[![Grand total](https://cranlogs.r-pkg.org/badges/grand-total/defm)](https://CRAN.R-project.org/package=defm)
[![R-CMD-check](https://github.com/UofUEpiBio/defm/actions/workflows/r.yml/badge.svg)](https://github.com/UofUEpiBio/defm/actions/workflows/r.yml)
<!-- badges: end -->
Discrete exponential family models (DEFM) have a long tradition with extensive
development rooted in exponential random graph models (ERGMs.) Applicable to
any form o data that can be represented as binary arrays, DEFMs provide a way
to model jointly distributed binary variables.
This package, built on top of the C++ library [`barry`](https://github.com/USCbiostats/barry),
provides a computationally efficient implementation of this family of models.
# Installation
You can install `defm` using devtools:
``` r
devtools::install_github("UofUEpi/defm")
```
# Examples
## Example 1: Four outcomes
In this example, we will simulate a dataset that contains 1,000 individuals with
four different outcomes. The outcomes, 0/1 vectors, will be modeled
as Markov processes of order one. Future states of this 0/1 vector
are a function of the previous point in time. The following lines
simulate the baseline data:
```{r data-simulation}
library(defm)
# Simulation parameters
set.seed(1231)
n <- 5000L # Count of individuals
n_y <- 3L # Number of dependent variables
n_x <- 2L # Number of independent variables
# Simulating how many observations we will have per individuals
n_reps <- sample(3:10, n, replace = TRUE)
# Final number of rows in the data
n_t <- sum(n_reps)
# Simulating the data
Y <- matrix(as.integer(runif(n_y * n_t) < .1), ncol = n_y)
colnames(Y) <- paste0("y", 1:n_y - 1)
X <- matrix(rnorm(n_x * n_t), ncol = n_x)
colnames(X) <- paste0("x", 1:n_x - 1L)
id <- rep(1L:n, n_reps)
time <- unlist(sapply(n_reps, \(x) 1:x))
```
Here is a brief look at the data structure. Remember, we still have not actually
simulated data **WITH THE MODEL**.
```{r echo = FALSE}
knitr::kable(head(cbind(id, time, Y, X), n = 10), row.names = FALSE, digits = 2)
```
For this example, we will simulate a model with the following features:
- **Ones**: Baseline density (prevalence of ones), $\sum_{itk}y_{itk}$
- **Ones x Attr 2**: Same as before, but weighted by one of the covariates. (simil to fixed effect) $\sum_{itk}y_{itk}x_{it}$
- **Transition** : And a transition structure, in particular `y0 -> (y0, y1)`, $y_{i0}^0y_{i0}^{t+1} y_{i1}^{t+1}$.
In `defm`, transition statistics can be represented using matrices. In this case, the transition can be written as:
$$
\begin{array}{c}t\\t+1\end{array}\left[\begin{array}{ccc}1 & \cdot & \cdot \\ 1 & 1 & \cdot\end{array}\right]
$$
which in `R` is
```r
> matrix(c(1,NA_integer_,NA_integer_,1,1,NA_integer_), nrow = 2, byrow = TRUE)
# [,1] [,2] [,3]
# [1,] 1 NA NA
# [2,] 1 1 NA
```
The `NA` entries in the matrix mean that those can be either zero or one. In other words, only values different from `NA` will be considered for specifying the terms. Here is the factory function:
```{r factory-function}
# Creating the model and adding a couple of terms
build_model <- function(id., Y., X., order. = 1, par. = par.) {
# Mapping the data to the C++ wrapper
d_model. <- new_defm(id., Y., X., order = order.)
# Adding the model terms
term_defm_ones(d_model.)
term_defm_ones(d_model., "x1")
transition <- matrix(NA_integer_, nrow = order. + 1, ncol = ncol(Y.))
transition[c(1,2,4)] <- 1
term_defm_transition(d_model., transition)
# Initializing the model
init_defm(d_model.)
# Returning
d_model.
}
```
With this factory function, we will use it to simulate some data with the same
dimensions of the original dataset. In this case, the parameters used for the simulation
will be:
- **Ones**: -2, i.e., low density,
- **Ones x Attr 2**: 2, yet correlated with covariate \# 2,
- **Transition** : 5, And a high chance of observing the transition `y0 -> (y0, y1)`
```{r simulation}
sim_par <- c(-2, 2, 5)
d_model <- build_model(id, Y, X, order = 1L, par. = sim_par)
simulated_Y <- sim_defm(d_model, sim_par)
head(cbind(id, simulated_Y))
```
Now, let's see if we can recover the parameters using MLE:
```{r mle}
d_model_sim <- build_model(id, simulated_Y, X, order = 1, par. = sim_par)
ans <- defm_mle(d_model_sim)
summary(ans)
```
Or better, we can use `texreg` to generate a pretty output:
```{r mle-out, echo=FALSE, results='asis'}
summary_table(ans, as_texreg=TRUE) |>
texreg::knitreg(single.row = TRUE)
```
We can also see the counts
```{r example1-counts, echo=FALSE}
obs_stats <- get_stats(d_model_sim)
colnames(obs_stats) <- names(d_model_sim)
colnames(simulated_Y) <- paste0("y", 1:n_y - 1)
cbind(id, simulated_Y, X, obs_stats) |>
head(n = 10) |> knitr::kable(row.names = FALSE, digits = 2L)
```
Finally, we can also take a look at the distribution of the log odds. We calculate this by looking at changes in a single entry of the array. For example,
the log-odds of having $y_{12}: 0\to 1$, which are equivalent to
\renewcommand{\Pr}[1]{\mbox{Pr}\left(#1\right)}
\newcommand{\Prcond}[2]{\Pr{\left.#1\vphantom{#2}\;\right|#2\vphantom{#1}}}
$$
\log{\frac{\Prcond{y_{12} = 1}{\mathbf{y}_{-12}}}{\Prcond{y_{12} = 0}{\mathbf{y}_{-12}}}} = \mbox{logit}\left(\Prcond{y_{12} = 1}{\mathbf{y}_{-12}}\right)
$$
We can use the `logodds` function for this:
```{r logodds}
lo <- logodds(d_model_sim, coef(ans), 1, 2)
hist(lo, main = "Distribution of Log-odds for term Y(1,2)",
col="tomato", border = "transparent", breaks = 100)
abline(v=0, lwd = 2, lty = 2, col = "steelblue")
```
## Example 2: A fun model
For fun, imagine that we want to describe a process in which an individual
moves sequentially through a set of states. In this example, there are
ten different `y` variables, but the person can only have one of them as
active (equal to one.) We can simulate such data using DEFM.
We first need to generate the baseline data for the simulation.
This involves creating a matrix of size 20x10 (so we have 20 time points,)
filled with zeros in all but the first entry:
```{r example2-sim}
n <- 20L
n_y <- 10L
id <- rep(1L, n)
Y <- matrix(0L, nrow = n, ncol = n_y)
Y[1] <- 1L
X <- matrix(0.0, nrow = n, ncol = 1)
```
With the data in hand, we can now simulate the process. First, we need
to build the model. The key component of the model will be the transition
matrices:
<!-- $$
\left[\begin{array}{cccc}
1 & 0 & NA & \dots \\
0 & 1 & NA & \dots
\end{array}\right] \dots
\left[\begin{array}{cccccc}
\dots & NA & 1 & 0 & NA & \dots \\
\dots & NA & 0 & 1 & NA & \dots
\end{array}\right] \dots
\left[\begin{array}{cccc}
\dots & NA & 1 & 0 \\
\dots & NA & 0 & 1
\end{array}\right]
$$ -->
<div align="center"><img style="background: white;" src="https://render.githubusercontent.com/render/math?math=%5Cleft%5B%5Cbegin%7Barray%7D%7Bcccc%7D%0A1%20%26%200%20%26%20NA%20%26%20%5Cdots%20%5C%5C%0A0%20%26%201%20%26%20NA%20%26%20%5Cdots%0A%5Cend%7Barray%7D%5Cright%5D%20%5Cdots%20%0A%5Cleft%5B%5Cbegin%7Barray%7D%7Bcccccc%7D%0A%5Cdots%20%26%20NA%20%26%201%20%26%200%20%26%20NA%20%26%20%5Cdots%20%20%5C%5C%0A%5Cdots%20%26%20NA%20%26%200%20%26%201%20%26%20NA%20%26%20%5Cdots%20%20%0A%5Cend%7Barray%7D%5Cright%5D%20%5Cdots%20%0A%5Cleft%5B%5Cbegin%7Barray%7D%7Bcccc%7D%0A%5Cdots%20%26%20NA%20%26%201%20%26%200%20%5C%5C%0A%5Cdots%20%26%20NA%20%26%200%20%26%201%20%0A%5Cend%7Barray%7D%5Cright%5D"></div>
Let's take a look at the process:
```{r}
# Creating a new instance of a DEFM object
d_model <- new_defm(id = id, Y = Y, X = X, order = 1)
# Creating the transition terms, these
for (i in (1:(n_y - 1) - 1)) {
transition <- matrix(NA_integer_, nrow = 2, ncol = n_y)
transition[c(1:4) + 2 * i] <- c(1,0,0,1)
term_defm_transition(d_model, transition)
}
# Here is the last transition term
transition <- matrix(NA_integer_, nrow = 2, ncol = n_y)
transition[c(n_y * 2 - 1, n_y * 2, 1, 2)] <- c(1,0,0,1)
term_defm_transition(d_model, transition)
# Adding a term of ones
term_defm_ones(d_model)
# Initializing and simulating
init_defm(d_model)
Y_sim <-sim_defm(d_model, par = c(rep(100, n_y), -10))
```
The simulation should produce a nice-looking figure:
```{r chain1fig, echo = FALSE}
library(Matrix)
Matrix::image(as(Y_sim, "dgCMatrix"))
```
## Example 3: Using formulas for transitions
In this example, we will redo the previous model, but now use formulas for
specifying the transitions:
```{r}
d_model_formula <- new_defm(id = id, Y = Y, X = X, order = 1)
# We can use text formulas to add transition terms
d_model_formula <- d_model_formula +
"{y0, 0y1} > {0y0, y1}" +
"{y1, 0y2} > {0y1, y2}" +
"{y2, 0y3} > {0y2, y3}" +
"{y3, 0y4} > {0y3, y4}" +
"{y4, 0y5} > {0y4, y5}" +
"{y5, 0y6} > {0y5, y6}" +
"{y6, 0y7} > {0y6, y7}" +
"{y7, 0y8} > {0y7, y8}" +
"{y8, 0y9} > {0y8, y9}" +
"{0y0, y9} > {y0, 0y9}"
d_model_formula |>
term_defm_ones()
init_defm(d_model_formula)
# Inspecting
d_model_formula
# Simulating
(Y_sim_formula <- sim_defm(d_model_formula, par = c(rep(20, n_y), -10)))
```
The new simulation...
```{r chain2fig, echo = FALSE}
image(as(Y_sim_formula, "dgCMatrix"))
```
# Acknowledgement
This work was supported by the Assistant Secretary of Defense for Health Affairs endorsed by the Department of Defense, through the Psychological Health/Traumatic Brain Injury Research Program Long-Term Impact of Military-Relevant Brain Injury Consortium (LIMBIC) Award/W81XWH-18-PH/TBIRP-LIMBIC under Award No. I01 RX003443. The U.S. Army Medical Research Acquisition Activity, 839 Chandler Street, Fort Detrick MD 21702-5014 is the awarding and administering acquisition office. Opinions, interpretations, conclusions and recommendations are those of the author and are not necessarily endorsed by the Department of Defense. Any opinions, findings, conclusions, or recommendations expressed in this publication are those of the author(s) and do not necessarily reflect the views of the U.S. Government, the U.S. Department of Veterans Affairs or the Department of Defense and no official endorsement should be inferred.
# Code of Conduct
Please note that the `defm` project is released with a [Contributor Code of Conduct](https://contributor-covenant.org/version/2/0/CODE_OF_CONDUCT.html). By contributing to this project, you agree to abide by its terms.