-
Notifications
You must be signed in to change notification settings - Fork 3
/
4.Beyond.Rmd
executable file
·186 lines (142 loc) · 8.62 KB
/
4.Beyond.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
---
######################################
# Click on "Run Document" in RStudio #
######################################
title: "Beyond model choice and parameter estimation"
author: "Miguel de Navascués"
output: learnr::tutorial
runtime: shiny_prerendered
bibliography: src/references.bib
---
These learning resources are available at [GitHub](https://github.com/mnavascues/ABCRFtutorial) & [Zenodo](http://doi.org/10.5281/zenodo.1435503).
## Setup (tl;dr)
This unit uses R packages `learnr` [@Schloerke2020], `abc` [@Csillery2012] and `weights` [@Pasek2018].
```{r setup, include=TRUE}
library(learnr)
# functions for ABC
suppressMessages(library(abc, quietly=T))
# function for weighted histogram
suppressMessages(library(weights, quietly=T))
# load data and reference table
target <- readRDS(file="data/target.RDS")
ref_table_1 <- readRDS(file="data/ref_table_1.RDS")
ref_table_2 <- readRDS(file="data/ref_table_2.RDS")
```
## Latent variables
A [latent variable](https://en.wikipedia.org/wiki/Latent_variable) is a variable of the model that is not directly observable, in opposition with variables that are observable (i.e. data). For instance, in the case of the coalescent model, the Time to the Most Recent Common Ancestor (TMRCA) is a random variable: each individual realization of the process with the same parameter values can have a different TMRCA. However, the TMRCA for real data is not observable. Some latent variables can be quantities of interest, so it can be useful to estimate them. Often, these latent variables have a strong impact on the data, so they have a higher correlation with them that some parameters. For instance, the correlation between $\theta$ and $\pi$ in the reference table of the constant model is 0.87 while the correlation between TMRCA and $\pi$ is 0.95 (Pearson correlation coefficient using `cor()`).
### Exercise 1
Estimate the TMRCA for *octomanati* and *rinocaracol*. Which model will you use? Why?
```{r eval=T, echo=T}
target <- readRDS(file="data/target.RDS")
ref_table_1 <- readRDS(file="data/ref_table_1.RDS")
ref_table_2 <- readRDS(file="data/ref_table_2.RDS")
```
```{r latent_variables, exercise=TRUE, exercise.lines=30}
octomanati_result <- abc(____)
rinocaracol_result <- abc(____)
wtd.quantile(octomanati_result$____,_____)
wtd.quantile(rinocaracol_result$____,_____)
```
```{r latent_variables-solution}
octomanati_result <- abc(target = target["octomanati",c("S","PI","NH","TD")],
param = ref_table_1[,"tmrca"],
sumstat = ref_table_1[,c("S","PI","NH","TD")],
tol = 0.01,transf = "log",method = "loclinear")
rinocaracol_result <- abc(target = target["rinocaracol",c("S","PI","NH","TD")],
param = ref_table_2[,"tmrca"],
sumstat = ref_table_2[,c("S","PI","NH","TD")],
tol = 0.01,transf = "log",method = "loclinear")
cat("TMRCA estimate (point estimate and 95%CI) for octomanati")
wtd.quantile(octomanati_result$adj.values,
weights = octomanati_result$weights,
probs = c(0.5, 0.025, 0.975))
cat("TMRCA estimate (point estimate and 95%CI) for rinocaracol")
wtd.quantile(rinocaracol_result$adj.values,
weights = rinocaracol_result$weights,
probs = c(0.5, 0.025, 0.975))
```
## Inference from several models
In some cases model choice might not give a clear result of one superior model over the rest of the models considered. If several models explain fairly well the observed data, which one should be used for parameter inference? Common *features* of the models can be inferred by pooling the reference tables from different models. These common features can be parameters (e.g. ancestral $\theta$ in a divergence model with several population), latent variables (e.g. TMRCA) or derived parameters/constants from the primary parameters (e.g. $\theta_{\mathrm{present}}$, which takes the value of $\theta$ for the constant size model and the value of $\theta_0$ for the population size change model).
### Exercise 2
Estimate again the TMRCA for *octomanati* using both models. Does it change substantially from using only one model?
```{r eval=T, echo=T}
target <- readRDS(file="data/target.RDS")
ref_table_1 <- readRDS(file="data/ref_table_1.RDS")
ref_table_2 <- readRDS(file="data/ref_table_2.RDS")
```
```{r tmrca_from_2_models, exercise=TRUE, exercise.lines=30}
ref_table <- rbind(ref_table_1[,c("tmrca","S","PI","NH","TD","FLD")],
ref_table_2[,c("tmrca","S","PI","NH","TD","FLD")])
octomanati_result <- abc(______)
```
```{r tmrca_from_2_models-solution}
ref_table <- rbind(ref_table_1[,c("tmrca","S","PI","NH","TD","FLD")],
ref_table_2[,c("tmrca","S","PI","NH","TD","FLD")])
octomanati_result <- abc(target = target["octomanati",c("S","PI","NH","TD","FLD")],
param = ref_table[,"tmrca"],
sumstat = ref_table[,c("S","PI","NH","TD","FLD")],
tol = 0.01,transf = "log",method = "loclinear")
cat("TMRCA estimate (point estimate and 95%CI) for octomanati")
wtd.quantile(octomanati_result$adj.values,
weights = octomanati_result$weights,
probs = c(0.5, 0.025, 0.975))
```
### Exercise 3
In this exercise we are going to use derived parameters common in the two models to build a ABC skyline plot [@Navascues2017]:
* From the primary parameters in the reference table for the two models considered (constant population size and population size change) calculate derived parameters $\theta_t$ for $t\in(0,0.25,0.5,0.75,1,1.25,1.5,1.75,2)$ for each simulation.
* Estimate those derived parameters for *octomanati* and *rinocaracol*.
* Plot the $\hat\theta_t$ against $t$ (use a line plot).
* Do you need to perform model selection to extract conclusions about the demography of these populations? Can we reject population size change for *octomanati*?
```{r eval=T, echo=T}
target <- readRDS(file="data/target.RDS")
ref_table_1 <- readRDS(file="data/ref_table_1.RDS")
ref_table_2 <- readRDS(file="data/ref_table_2.RDS")
```
```{r skylineplot, exercise=TRUE, exercise.lines=50}
ref_table <- rbind(ref_table_1[,c("S","PI","NH","TD","FLD")],
ref_table_2[,c("S","PI","NH","TD","FLD")])
tt <- seq(0,2,0.25)
___________
```
```{r skylineplot-solution}
ref_table <- rbind(ref_table_1[,c("S","PI","NH","TD","FLD")],
ref_table_2[,c("S","PI","NH","TD","FLD")])
tt <- seq(0,2,0.25)
for (i in seq_along(tt)){
temp_theta_t <- ref_table_2$theta0
temp_theta_t[ref_table_2$tau<tt[i]] <- ref_table_2$theta1[ref_table_2$tau<tt[i]]
temp_theta_t <- c(ref_table_1$theta,temp_theta_t)
ref_table <- cbind(ref_table,temp_theta_t)
}
colnames(ref_table) <- c("S","PI","NH","TD","FLD","theta_t0.00",
"theta_t0.25","theta_t0.50","theta_t0.75","theta_t1.00",
"theta_t1.25","theta_t1.50","theta_t1.75","theta_t2.00")
octomanati_result <- abc(target = target["octomanati",c("S","PI","NH","TD","FLD")],
param = ref_table[,c("theta_t0.00",
"theta_t0.25","theta_t0.50","theta_t0.75","theta_t1.00",
"theta_t1.25","theta_t1.50","theta_t1.75","theta_t2.00")],
sumstat = ref_table[,c("S","PI","NH","TD","FLD")],
tol = 0.01,transf = "log",method = "loclinear")
rinocaracol_result <- abc(target = target["rinocaracol",c("S","PI","NH","TD","FLD")],
param = ref_table[,c("theta_t0.00",
"theta_t0.25","theta_t0.50","theta_t0.75","theta_t1.00",
"theta_t1.25","theta_t1.50","theta_t1.75","theta_t2.00")],
sumstat = ref_table[,c("S","PI","NH","TD","FLD")],
tol = 0.01,transf = "log",method = "loclinear")
octomanati_skyline <- apply(octomanati_result$adj.values, 2, wtd.quantile,
weights = octomanati_result$weights, probs = c(0.5,0.025,0.975))
rinocaracol_skyline <- apply(rinocaracol_result$adj.values, 2, wtd.quantile,
weights = rinocaracol_result$weights, probs = c(0.5,0.025,0.975))
plot(tt,log10(octomanati_skyline[1,]),
ylim=c(-1,2), # prior limits
type="l",col="blue",lwd=2,
ylab=expression(log[10]*theta),
xlab="time (in number of mutations)")
lines(tt,log10(octomanati_skyline[2,]), col="blue", lty=2)
lines(tt,log10(octomanati_skyline[3,]), col="blue", lty=2)
lines(tt,log10(rinocaracol_skyline[1,]), col="red", lwd=2)
lines(tt,log10(rinocaracol_skyline[2,]), col="red", lty=2)
lines(tt,log10(rinocaracol_skyline[3,]), col="red", lty=2)
```
#
#### References