(Thank you @anatomecha for the sweet hex)
The goal of simeta
is to make it easier to simulate meta-analysis
data.
These are a sometimes-useful modular set of tools for simulating meta-anlaysis data I mostly developed during my doctoral project. For technical details, see the associated manuscript (freely-accessible arxiv version):
Charles T. Gray. “code::proof: Prepare for Most Weather Conditions”. en. In: Statistics and Data Science. Communications in Computer and Information Science (2019). Ed. by Hien Nguyen, pp. 22–41. DOi: 10.1007/978-981-15-1960-4_2
In particular, I’m often interested in simulating meta-analysis data for different values of
-
$k$ , number of studies -
$\tau^2$ , variation between studies -
$\varepsilon^2$ , variation within a study - numbers of trials, say 10, 100, 1000
- distributions, and parameters; e.g.,
$\exp(\lambda = 1)$ and$\exp(\lambda = 2)$ .
This readme is not comprehensive, but updated as collaborators need scripts and examples.
Have ideas? Want to contribute? Request a workflow, etc. Create an issue.
You can install simeta
from github with:
# install.packages("devtools")
devtools::install_github("softloud/simeta")
# packages
library(simeta)
library(tidyverse)
library(knitr)
library(metafor)
library(targets)
# so these results are reproducible
set.seed(306)
Suppose we are interested in comparing how variation between studies and overall sample size influence the likelihood of significance in meta-analyses with small to large effects.
# set default parameters, useful to store as an object for visualisation
# labelling
# set effect ratios of interest
sim_effect_ratio <- c(1, 1.5)
# set desired variance between studies
sim_tau_sq <- c(0, 0.2)
# set minimum sample size per study
sim_min_n <- 5
# set maximum sample size per study
sim_max_n <- 150
Use sim_df
to set up a dataframe of simulation parameters, wherein
every row represents one combination of simulation parameters.
# see ?sim_df to see what other things can be specified
sim_parameters <- sim_df(
# choose three default distributions to sample from to keep example small
dist_df = default_parameters %>% sample_n(3),
tau_sq = sim_tau_sq,
effect_ratio = sim_effect_ratio,
min_n = sim_min_n,
max_n = sim_max_n
)
Use sim_samples
to create a dataframe where each row of the simulation
parameter level dataframe is repeated trials
times, and a new
list-column of meta-analysis samples are generated using the row-level
simulation parameters.
# Generate simulated meta-analyses from simulation parameters dataframe
# Number of trials represents number of repeated rows per simulation parameter
# set
samples_df <-
sim_samples(
measure = "mean",
measure_spread = "sd",
sim_dat = sim_parameters,
# small for the purposes of example
trials = 3
)
# take a look at a few rows
samples_df %>% sample_n(5)
#> # A tibble: 5 × 8
#> k tau_sq_true effect_ratio rdist parameters n sim_id sample
#> <dbl> <dbl> <dbl> <chr> <list> <list> <chr> <list>
#> 1 7 0.2 1 pareto <named list> <tibble> sim 13 <tibble>
#> 2 7 0 1.5 pareto <named list> <tibble> sim 22 <tibble>
#> 3 7 0 1 pareto <named list> <tibble> sim 4 <tibble>
#> 4 3 0 1.5 norm <named list> <tibble> sim 21 <tibble>
#> 5 3 0.2 1.5 pareto <named list> <tibble> sim 28 <tibble>
# example simulated meta-analysis dataset
samples_df %>% sample_n(1) %>% pluck("sample") %>% kable()
|
# Generate meta-anlyses and extract a parameter of interest, such as p-value.
sim_metafor <-
samples_df %>%
# making small for purposes of example (simulations scale fast!)
sample_n(20) %>%
mutate(
rma = map(sample,
function(x) {
metafor::rma(data = x,
measure = "SMD",
m1i = effect_c,
sd1i = effect_spread_c,
n1i = n_c,
m2i = effect_i,
sd2i = effect_spread_i,
n2i = n_i
)}
)
) %>%
# extract pvalues
mutate(
p_val = map_dbl(rma, pluck, "pval")
)
# take a look
sim_metafor %>%
mutate(p_val = round(p_val, 2)) %>%
select(p_val, everything()) %>%
# see first three rows
head(3)
#> # A tibble: 3 × 10
#> p_val k tau_sq_true effect_r…¹ rdist parameters n sim_id sample
#> <dbl> <dbl> <dbl> <dbl> <chr> <list> <list> <chr> <list>
#> 1 0.03 3 0 1 lnorm <named list> <tibble> sim 2 <tibble>
#> 2 0.41 3 0.2 1.5 lnorm <named list> <tibble> sim 29 <tibble>
#> 3 0.02 7 0.2 1.5 pare… <named list> <tibble> sim 31 <tibble>
#> # … with 1 more variable: rma <list>, and abbreviated variable name
#> # ¹effect_ratio
I find I run into problems very quickly with memory. The targets
package can help. For example, the following example samples from
several distributions, each with different parameter sets. This
combinatorially increases the number of simulations.
default_parameters %>%
kable(digits = 2)
dist | par |
---|---|
pareto | 2, 1 |
norm | 50, 17 |
lnorm | 4.0, 0.3 |
exp | 10 |
pareto | 3.576119, 2.745808 |
norm | 75.209383, 6.739041 |
lnorm | 2.3900182, 0.3383603 |
exp | 4.86717 |
See _targets.R for an example script that produces this visualisation adapting the workflow above. Each point in this plot represents one p-value from a meta-analysis on a randomly-generated dataset.
include_graphics("man/figures/example_sim.png")
include_graphics("man/figures/example_pipeline.png")
tar_load(trial_results)
tar_load(sim_df)
#> ✖ Identified no targets to load.
dim(trial_results)
#> [1] 21596 12
# move to
trial_results %>%
group_by(sim_id) %>%
summarise(
n_trials = n()
) %>%
full_join(trial_results, by = "sim_id", multiple = "all") %>%
select(-p_value_result, -dist_label) %>%
head()
#> # A tibble: 6 × 11
#> sim_id n_trials k tau_sq_true effect_…¹ rdist parameters parti…² study…³
#> <chr> <int> <dbl> <dbl> <dbl> <chr> <list> <int> <fct>
#> 1 sim 1 100 3 0 1 pare… <named list> 90 3 stud…
#> 2 sim 1 100 3 0 1 pare… <named list> 90 3 stud…
#> 3 sim 1 100 3 0 1 pare… <named list> 90 3 stud…
#> 4 sim 1 100 3 0 1 pare… <named list> 90 3 stud…
#> 5 sim 1 100 3 0 1 pare… <named list> 90 3 stud…
#> 6 sim 1 100 3 0 1 pare… <named list> 90 3 stud…
#> # … with 2 more variables: p_value <dbl>, significant <lgl>, and abbreviated
#> # variable names ¹effect_ratio, ²participants, ³study_n_label
Without specification, the function uses the default parameters dataset
(?default_parameters
).
default_parameters %>% kable()
dist | par |
---|---|
pareto | 2, 1 |
norm | 50, 17 |
lnorm | 4.0, 0.3 |
exp | 10 |
pareto | 3.576119, 2.745808 |
norm | 75.209383, 6.739041 |
lnorm | 2.3900182, 0.3383603 |
exp | 4.86717 |
This dataset also provides a template for how to set up a dataframe
specifying the distributions and parameters of interest for sim_df
.
The default sampling distributions are designed to provide a mix of
common symmetric and asymmetric families, with both fixed and
randomly-generated parameters.
sim_dat <-
# defaults to using default_parameters if we do not specify dist_df argument
sim_df(
# different effect sizes
# what is small, medium large effect
effect_ratio = sim_effect_ratio,
# what is small medium large variance
tau_sq = sim_tau_sq,
min_n = sim_min_n,
max_n = sim_max_n
)
# take a look at the top of the dataset
sim_dat %>% head(3)
#> # A tibble: 3 × 7
#> k tau_sq_true effect_ratio rdist parameters n sim_id
#> <dbl> <dbl> <dbl> <chr> <list> <list> <chr>
#> 1 3 0 1 pareto <named list [2]> <tibble [6 × 3]> sim 1
#> 2 3 0 1 norm <named list [2]> <tibble [6 × 3]> sim 2
#> 3 3 0 1 lnorm <named list [2]> <tibble [6 × 3]> sim 3
# the end of the dataset
sim_dat %>% tail(3)
#> # A tibble: 3 × 7
#> k tau_sq_true effect_ratio rdist parameters n sim_id
#> <dbl> <dbl> <dbl> <chr> <list> <list> <chr>
#> 1 20 0.2 1.5 norm <named list [2]> <tibble [40 × 3]> sim 94
#> 2 20 0.2 1.5 lnorm <named list [2]> <tibble [40 × 3]> sim 95
#> 3 20 0.2 1.5 exp <named list [1]> <tibble [40 × 3]> sim 96
# take a look at a random handful of rows
sim_dat %>% sample_n(5)
#> # A tibble: 5 × 7
#> k tau_sq_true effect_ratio rdist parameters n sim_id
#> <dbl> <dbl> <dbl> <chr> <list> <list> <chr>
#> 1 3 0 1.5 pareto <named list [2]> <tibble [6 × 3]> sim 49
#> 2 20 0.2 1 norm <named list [2]> <tibble> sim 46
#> 3 7 0 1.5 exp <named list [1]> <tibble> sim 64
#> 4 3 0.2 1 lnorm <named list [2]> <tibble [6 × 3]> sim 27
#> 5 3 0 1 lnorm <named list [2]> <tibble [6 × 3]> sim 7
sim_df
uses sim_n
as explained below to create each dataset of
sample sizes.
This is a function I have often wished I’ve had on hand when simulating meta-analysis data. Thing is, running, say, 1000 simulations, I want to do this for the same sample sizes. So, I need to generate the sample sizes for each study and for each group (control or intervention).
Given a specific
# defaults to k = 3
sim_n() %>% kable()
study | group | n |
---|---|---|
Estelmo_1960 | control | 89 |
Snaga_1953 | control | 20 |
Théoden_1959 | control | 76 |
Estelmo_1960 | intervention | 91 |
Snaga_1953 | intervention | 20 |
Théoden_1959 | intervention | 77 |
sim_n(k = 3) %>% kable()
study | group | n |
---|---|---|
Núneth_2002 | control | 68 |
Folcred_1957 | control | 105 |
Boromir_1957 | control | 14 |
Núneth_2002 | intervention | 64 |
Folcred_1957 | intervention | 93 |
Boromir_1957 | intervention | 14 |
# set k to a different value
sim_n(k = 6) %>% kable()
study | group | n |
---|---|---|
Gwaihir_2002 | control | 42 |
Amarië_1964 | control | 80 |
Nori_1964 | control | 42 |
Salmar_1987 | control | 92 |
Anárion_1973 | control | 62 |
Ulwarth_2007 | control | 95 |
Gwaihir_2002 | intervention | 43 |
Amarië_1964 | intervention | 67 |
Nori_1964 | intervention | 33 |
Salmar_1987 | intervention | 99 |
Anárion_1973 | intervention | 59 |
Ulwarth_2007 | intervention | 93 |
# control upper and lower bounds
sim_n(min_n = 3, max_n = 50) %>% kable()
study | group | n |
---|---|---|
Telchar_1972 | control | 14 |
Asgon_1998 | control | 14 |
Vëantur_1955 | control | 18 |
Telchar_1972 | intervention | 16 |
Asgon_1998 | intervention | 11 |
Vëantur_1955 | intervention | 18 |
We expect cohorts from the same study to have roughly the same size,
proportional to that size. We can control this proportion with the
prop
argument.
Suppose we wish to mimic data for which the cohorts are almost exactly the same (say becaues of classes of undergrads being split in half and accounting for dropouts).
# small variation between sample sizes of studies
sim_n(k = 2, prop = 0.05, max_n = 50) %>% kable()
study | group | n |
---|---|---|
Beren_1979 | control | 28 |
Azog_2013 | control | 19 |
Beren_1979 | intervention | 2 |
Azog_2013 | intervention | 1 |
It can be useful, for more human-interpretable purposes, to display the sample sizes in wide format.
Adding a few values of
I haven’t settled on a way of simulating data, and haven’t found heaps in the way of guidance yet. So, this is all a bit experimental. My guiding star is packaging what I’d use right now.
What I do always end up with is generating a dataset that summarises what I would like to iterate over in simulation.
The sim_df
takes user inputs for distributions, numbers of studies,
between-study error
# defaults
sim_df()
#> # A tibble: 216 × 7
#> k tau_sq_true effect_ratio rdist parameters n sim_id
#> <dbl> <dbl> <dbl> <chr> <list> <list> <chr>
#> 1 3 0 1 pareto <named list [2]> <tibble> sim 1
#> 2 3 0 1 norm <named list [2]> <tibble> sim 2
#> 3 3 0 1 lnorm <named list [2]> <tibble> sim 3
#> 4 3 0 1 exp <named list [1]> <tibble> sim 4
#> 5 3 0 1 pareto <named list [2]> <tibble> sim 5
#> 6 3 0 1 norm <named list [2]> <tibble> sim 6
#> 7 3 0 1 lnorm <named list [2]> <tibble> sim 7
#> 8 3 0 1 exp <named list [1]> <tibble> sim 8
#> 9 7 0 1 pareto <named list [2]> <tibble> sim 9
#> 10 7 0 1 norm <named list [2]> <tibble> sim 10
#> # … with 206 more rows
sim_df() %>% str(1)
#> tibble [216 × 7] (S3: tbl_df/tbl/data.frame)
# only consider small values of k
sim_df(k = c(2, 5, 7)) %>% str(1)
#> tibble [216 × 7] (S3: tbl_df/tbl/data.frame)
For the list-column of tibbles n
, the sim_df
function calls sim_n
and generates a set of sample sizes based on the value in the column
k
.
demo_k <- sim_df()
# the variable n is a list-column of tibbles
demo_k %>% pluck("n") %>% head(3)
#> [[1]]
#> # A tibble: 6 × 3
#> study group n
#> <chr> <chr> <dbl>
#> 1 Mandos_2000 control 23
#> 2 Niënor_1955 control 62
#> 3 Yavanna_2015 control 17
#> 4 Mandos_2000 intervention 25
#> 5 Niënor_1955 intervention 70
#> 6 Yavanna_2015 intervention 19
#>
#> [[2]]
#> # A tibble: 6 × 3
#> study group n
#> <chr> <chr> <dbl>
#> 1 Aravir_1953 control 35
#> 2 Anborn_1982 control 74
#> 3 Elphir_1984 control 17
#> 4 Aravir_1953 intervention 34
#> 5 Anborn_1982 intervention 71
#> 6 Elphir_1984 intervention 17
#>
#> [[3]]
#> # A tibble: 6 × 3
#> study group n
#> <chr> <chr> <dbl>
#> 1 Aragost_1981 control 67
#> 2 Orcobal_1994 control 88
#> 3 Khîm_1954 control 15
#> 4 Aragost_1981 intervention 64
#> 5 Orcobal_1994 intervention 82
#> 6 Khîm_1954 intervention 16
# compare the number of rows in the dataframe in the n column with the k value
# divide by two because there are two rows for each study,
# one for each group, control and intervention
demo_k %>% pluck("n") %>% map_int(nrow) %>% head(3) / 2
#> [1] 3 3 3
demo_k %>% pluck("k") %>% head(3)
#> [1] 3 3 3
Once we have established a set of sample sizes for a given distribution,
with parameters, and so forth, I usually want to generate a sample for
each of those n
. We need to adjust the value of the sampled data based
on the median ratio, and whether the n
is from a control or
intervention group.
A random effect is added to account for the between study error
For meta-analysis data, we work with summmary statistics, so we drop the sample and return tabulated summary stats.
sim_stats() %>% kable()
study | effect_c | effect_spread_c | n_c | effect_i | effect_spread_i | n_i |
---|---|---|---|---|---|---|
Bob_2004 | 58.77079 | 0.2146638 | 48 | 51.03713 | 0.1810027 | 42 |
Oromë_1981 | 49.68544 | 0.2232972 | 45 | 60.28628 | 0.1896956 | 43 |
Ufthak_1965 | 68.67402 | 0.2019829 | 42 | 43.72458 | 0.1893396 | 48 |
- facet labels with TeX wrapping
- parameter sets with colour coding
Archived PhD work (see phd-scripts/)
In a trial, we’d first want to simulate some data, for a given
distribution, for this we use the sim_stats
function discussed in the
above section.
With the summary statistics, we then calculate an estimate of the effect or the variance of the effect.
- simulate data
- calculate summary statistics
- calculate estimates using summary statistics
- calculate effects using estimates (difference, standardised, log-ratio)1
- meta-analyse
- return simulation results of interest
The first two steps are taken care of by the sim_stats
function. The
third step will by necessity be bespoke.
But the rest could be automated, assuming there are the same kinds of results.
step | input | output |
---|---|---|
calculate estimates | summary statistics as produced by sim_n |
summary stats |
calculate effects | summary stats | effect and effect_se |
meta-analyse | effect and effect_se |
rma object |
summary stats | rma object |
some kind of broom ing script |
metatrial()
So, now we can put together some generic summarisations. Things I always
want to do. Like calculate the coverage probability, confidence interval
width, and bias. Most results here are mean values across all trials,
the exceptions being cp_
variables.
metasim
calls metatrial
many times and summarises the results.
metasim()
(sim <- metasims())
sim %>% coverage_plot()
Footnotes
-
Ideally this would be configurable but let’s hardcode it for now. ↩