-
Notifications
You must be signed in to change notification settings - Fork 3
/
3.ClassicalABC.Rmd
executable file
·330 lines (256 loc) · 12.6 KB
/
3.ClassicalABC.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
---
######################################
# Click on "Run Document" in RStudio #
######################################
title: "Classical ABC"
author: "Miguel de Navascués"
output: learnr::tutorial
runtime: shiny_prerendered
bibliography: src/references.bib
---
These learning resources are available at [GitHub](https://github.com/mnavascues/ABCRFtutorial) & [Zenodo](http://doi.org/10.5281/zenodo.1435503).
## Setup (tl;dr)
This unit uses R packages `learnr` [@Schloerke2020], `abc` [@Csillery2012] and `weights` [@Pasek2018].
```{r setup, include=TRUE}
library(learnr)
# functions for ABC
suppressMessages(library(abc, quietly=T))
# function for weighted histogram
suppressMessages(library(weights, quietly=T))
# load data and reference table
target <- readRDS(file="data/target.RDS")
ref_table_1 <- readRDS(file="data/ref_table_1.RDS")
ref_table_2 <- readRDS(file="data/ref_table_2.RDS")
```
## Regression ABC
As we have seen in unit 1, there is a trade-off from adding some tolerance to the rejection algorithm: with a big tolerance the results tend to the prior, with a small tolerance the is a loss of precision in the posterior distribution estimate. This problem is partially corrected by an additional local regression step [@Beaumont2002]. The local regression is between the parameter values and the summary statistics from the simulations accepted after the rejection step. Parameter values from those simulations are then *adjusted* by projection using the regression. The regression algorithm allows to increase the proportion of simulations accepted correcting to some degree the *widening* of the posterior distribution.
### Exercise 1
Explore the application of the regression algorithm implemented in `abc()`.
* Compare the posterior probability distribution estimated from adjusted and from unadjusted (original) values.
* Explore how this difference changes with the value of tolerance (for instance `tolerance<-0.8`).
* Try also with Tajima's $D$ as summary statistic (`TD`).
* Try using using $\pi$ and Tajima's $D$ (i.e. `SS <-target[1,c("PI","TD")]`) and skip the scatter plot.
* Try using $\pi$ and some completely uninformative summary statistic (i.e. generate a fake random summary statistic using `runif()` for both reference table and target data).
* Do you think a linear regression makes sense in this case? Open the help for `abc()` function and read about the option `method="neuralnet"`.
```{r eval=T, echo=T}
target <- readRDS(file="data/target.RDS")
ref_table_1 <- readRDS(file="data/ref_table_1.RDS")
```
```{r abc_regression, exercise=TRUE, exercise.lines=30}
num_of_sims <- 10000 # use a subset of simulation from a total of 1 000 000
SS <- target[1,"PI"]
SS_prime <- ref_table_1[seq_len(num_of_sims),"PI"]
theta_prime <- ref_table_1$theta[seq_len(num_of_sims)]
tolerance <- 0.2
abc_result <- abc(target = SS,
param = theta_prime,
sumstat = SS_prime,
tol = tolerance,
transf = "log", hcorr = F,
method = "loclinear")
theta_prime_accept <- as.vector(abc_result$unadj.values)
theta_prime_accept_reg <- as.vector(abc_result$adj.values)
SS_prime_accept <- abc_result$ss
local_regression <- lm(log10(theta_prime_accept)~SS_prime_accept,
weights=abc_result$weights)
plot(log10(theta_prime), SS_prime,
xlab = expression(log[10](theta*"'")),
ylab = "SS'",
ylim = c(0,30))
abline(h = SS, col = "blue")
points(log10(theta_prime_accept), SS_prime_accept, col = "blue")
abline(a = -local_regression$coefficients[1]/local_regression$coefficients[2],
b = 1/local_regression$coefficients[2],
col = "yellow",
lwd = 3)
points(x = log10(theta_prime_accept)[1:10],
y = SS_prime_accept[1:10],col = "yellow")
arrows(x0 = log10(theta_prime_accept)[1:10],
y0 = SS_prime_accept[1:10],
x1 = log10(theta_prime_accept_reg)[1:10],
y1 = SS,
col = "yellow",
length = 0.1)
hist(x = log10(theta_prime),
breaks = seq(-1,2,0.05),
col = "grey",
freq = FALSE,
ylim = c(0,4),
main = "",
xlab = expression(log[10](theta)),
ylab = "probability density")
hist(x = log10(theta_prime_accept),
breaks = seq(-1,2,0.05),
col = "blue", freq=FALSE, add=TRUE )
wtd.hist(x = log10(theta_prime_accept_reg),
breaks = seq(-1.5,2,0.05),
col = "yellow", freq=FALSE, add=TRUE,
weight = abc_result$weights)
```
```{r abc_regression-solution}
num_of_sims <- 10000 # use a subset of simulation from a total of 1 000 000
SS <- SS <-target[1,c("PI","TD")]
SS_prime <- ref_table_1[seq_len(num_of_sims),c("PI","TD")]
theta_prime <- ref_table_1$theta[seq_len(num_of_sims)]
tolerance <- 0.1
abc_result <- abc(target = SS,
param = theta_prime,
sumstat = SS_prime,
tol = tolerance,
transf = "log", hcorr = F,
method = "loclinear")
theta_prime_accept <- as.vector(abc_result$unadj.values)
theta_prime_accept_reg <- as.vector(abc_result$adj.values)
hist(x = log10(theta_prime),
breaks = seq(-1,2,0.05),
col = "grey",
freq = FALSE,
ylim = c(0,4),
main = "",
xlab = expression(log[10](theta)),
ylab = "probability density")
hist(x = log10(theta_prime_accept),
breaks = seq(-1,2,0.05),
col = "blue", freq=FALSE, add=TRUE )
wtd.hist(x = log10(theta_prime_accept_reg),
breaks = seq(-1.5,2,0.05),
col = "yellow", freq=FALSE, add=TRUE,
weight = abc_result$weights)
```
## Practice classical ABC
### Exercise 2
Using ABC, estimate $\theta$ for *octomanati* and *rinocaracol*. Provide a measure of the uncertainty of the estimate:
```{r eval=T, echo=T}
target <- readRDS(file="data/target.RDS")
ref_table_1 <- readRDS(file="data/ref_table_1.RDS")
```
```{r practice_abc, exercise=TRUE, exercise.lines=30}
octomanati_result <- abc(_____)
______(octomanati_result$adj.values,
weights = octomanati_result$weights,
probs = c(0.5, 0.025, 0.975))
```
```{r practice_abc-solution}
octomanati_result <- abc(target = target["octomanati",c("S","PI","NH","TD","FLD")],
param = ref_table_1[,"theta"],
sumstat = ref_table_1[,c("S","PI","NH","TD","FLD")],
tol = 0.01,transf = "log",method = "loclinear")
rinocaracol_result <- abc(target = target["rinocaracol",c("S","PI","NH","TD","FLD")],
param = ref_table_1[,"theta"],
sumstat = ref_table_1[,c("S","PI","NH","TD","FLD")],
tol = 0.01,transf = "log",method = "loclinear")
cat("theta estimate (point estimate and 95%CI) for octomanati")
wtd.quantile(octomanati_result$adj.values,
weights = octomanati_result$weights,
probs = c(0.5, 0.025, 0.975))
cat("theta estimate (point estimate and 95%CI) for rinocaracol")
wtd.quantile(rinocaracol_result$adj.values,
weights = rinocaracol_result$weights,
probs = c(0.5, 0.025, 0.975))
```
## Model choice
We are going to consider two models. A constant size model with parameter $\theta = 2N\mu$ and a population with one instantaneous change of size with parameters $\theta_0 = 2N_0\mu$ (present scaled mutation rate), $\tau = t\mu$ (mutation scaled time of population size change) and $\theta_1 = 2N_1\mu$ (past scaled mutation rate). We already have use the simulations for the first model (reference table in file `ref_table_1.RDS`), the reference table for the second model is available in file `ref_table_2.RDS` which has been generated with the following code:
```{r eval=FALSE, echo=TRUE}
sample_size <- 50
num_of_sims <- 100000
theta0 <- 10^runif(num_of_sims,min=-1,max=2)
theta1 <- 10^runif(num_of_sims,min=-1,max=2)
tau <- runif(num_of_sims,min=0,max=2)
msout <- ms(nsam = sample_size,
opt = "-t tbs -eN tbs tbs -T",
tbs.matrix = cbind(theta0, tau/theta0, theta1/theta0))
tmrca <- theta0*sapply(read.tree(text = msout[grep("//",msout)+1]),
get.rooted.tree.height)
msout <- read.ms.output(txt=msout)
ref_table <- data.frame(theta0=theta0,
theta1=theta1,
tau=tau,
tmrca=tmrca,
S=S(msout),
PI=PI(msout),
NH=NH(msout),
TD=TajimaD(msout),
FLD=FuLiD(msout))
saveRDS(ref_table,file="data/ref_table_2.RDS")
```
### Exercise 3
Explore the goodness of fit for this new model:
```{r eval=T, echo=T}
target <- readRDS(file="data/target.RDS")
ref_table_2 <- readRDS(file="data/ref_table_2.RDS")
```
```{r goodness_of_fit_model2, exercise=TRUE, exercise.lines=30}
num_of_sims <- 10000
# Principal Component Analysis
PCA_result <- princomp(ref_table_2[,c("S","PI","NH","TD","FLD")])
pc<-c(1,2)
plot(PCA_result$scores[seq_len(num_of_sims),pc[1]],
PCA_result$scores[seq_len(num_of_sims),pc[2]],
xlab=paste("PCA",pc[1]),ylab=paste("PCA",pc[2]))
PCA_target <- predict(____)
plot(____)
```
```{r goodness_of_fit_model2-solution}
num_of_sims <- 10000
# Principal Component Analysis
PCA_result <- princomp(ref_table_2[,c("S","PI","NH","TD","FLD")])
pc<-c(1,2)
plot(PCA_result$scores[seq_len(num_of_sims),pc[1]],
PCA_result$scores[seq_len(num_of_sims),pc[2]],
xlab=paste("PCA",pc[1]),ylab=paste("PCA",pc[2]))
# projection of target data
PCA_target <- predict(PCA_result, target[,c("S","PI","NH","TD","FLD")])
points(PCA_target[,pc[1]],PCA_target[,pc[2]],col=c("yellow","orange"))
plot(ref_table_2$TD[seq_len(num_of_sims)],
ref_table_2$FLD[seq_len(num_of_sims)],
xlab="Tajima's D", ylab="Fu and Li's D")
points(target$TD, target$FLD, col=c("yellow","orange"))
```
ABC model choice follows a similar process as parameter inference: references tables are pooled together and rejection step is applied to keep simulations closer to the observed data (based in summary statistics). The regression step is different, model is treated as a qualitative variable and a logistic regression is used to estimate posterior probabilities of the model.
### Exercise 4
Use function `postpr()` from `abc` package to calculate posterior probabilities for each model for *octomanati* and *rinocaracol*:
```{r eval=T, echo=T}
target <- readRDS(file="data/target.RDS")
ref_table_1 <- readRDS(file="data/ref_table_1.RDS")
ref_table_2 <- readRDS(file="data/ref_table_2.RDS")
```
```{r model_choice, exercise=TRUE, exercise.lines=30}
num_of_sims <- 10000
model <- c(rep("constant population size",num_of_sims),rep("population size change",num_of_sims))
sumstats <- rbind(ref_table_1[seq_len(num_of_sims),c("S","PI","NH","TD","FLD")],
ref_table_2[seq_len(num_of_sims),c("S","PI","NH","TD","FLD")])
```
```{r model_choice-solution}
num_of_sims <- 10000
model <- c(rep("constant population size",num_of_sims),rep("population size change",num_of_sims))
sumstats <- rbind(ref_table_1[seq_len(num_of_sims),c("S","PI","NH","TD","FLD")],
ref_table_2[seq_len(num_of_sims),c("S","PI","NH","TD","FLD")])
model_choice_octomanati <- postpr(target=target["octomanati",],
index=model,
sumstat=sumstats,
tol=0.1,
method="mnlogistic")
model_choice_rinocaracol <- postpr(target=target["rinocaracol",],
index=model,
sumstat=sumstats,
tol=0.1,
method="mnlogistic")
# Calculate Bayes Factor (with prior probability 0.5 for each model)
octomanati_K <- model_choice_octomanati$pred[2]/model_choice_octomanati$pred[1]
rinocaracol_K <- model_choice_rinocaracol$pred[2]/model_choice_rinocaracol$pred[1]
cat("Model posterior probabilities for octomanati:"); model_choice_octomanati$pred
cat("Corresponding Bayes factor is:"); octomanati_K
cat("Model posterior probabilities for rinocaracol:"); model_choice_rinocaracol$pred
cat("Corresponding Bayes factor is:"); rinocaracol_K
```
In Bayesian statistics the support of one model over the alternative is quantified using the [Bayes factor (K)](https://en.wikipedia.org/wiki/Bayes_factor). The strength of the support is often evaluated following Jeffrey's scale of interpretation:
K | Strength of evidence for Model 1
-----|--------------------------------
$<10^0$ | Support for the Model 2
$10^0$ to $10^{0.5}$ | Weak
$10^{0.5}$ to $10^{1}$ | Substantial
$10^1$ to $10^{1.5}$ | Strong
$10^{1.5}$ to $10^2$ | Very strong
$>10^2$ | Decisive
#
#### References