-
Notifications
You must be signed in to change notification settings - Fork 3
/
2.GoodPractices.Rmd
executable file
·279 lines (197 loc) · 10.5 KB
/
2.GoodPractices.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
---
######################################
# Click on "Run Document" in RStudio #
######################################
title: "Good Practices in ABC"
author: "Miguel de Navascués"
output: learnr::tutorial
runtime: shiny_prerendered
bibliography: src/references.bib
---
These learning resources are available at [GitHub](https://github.com/mnavascues/ABCRFtutorial) & [Zenodo](http://doi.org/10.5281/zenodo.1435503).
## Setup (tl;dr)
This unit uses R packages `learnr` [@Schloerke2020], `phyclust` [@Chen2011] and `abc` [@Csillery2012].
```{r setup, include=TRUE}
library(learnr)
# function ms() (Hudson's coalescent simulator)
library(phyclust, quietly=T)
# functions for ABC
suppressMessages(library(abc, quietly=T))
# load data and reference table
target <- readRDS(file="data/target.RDS")
ref_table_1 <- readRDS(file="data/ref_table_1.RDS")
ref_table_1b <- readRDS(file="data/ref_table_1b.RDS")
```
## Evaluation of the model and prior: Goodness of fit
#### (after generating the reference table & before running ABC)
As discussed in the previous unit, the threshold distance for the rejection algorithm is not set *a priori*. Instead a proportion of simulations closer to the observed (tolerance) is decided *a priori* and the furthest simulation among the ones kept defines *a posteriori* the threshold distance. This can be problematic. If the model and priors used for the simulation generate simulated data very different from observation, the ABC approach will still chose a proportion of them that are the "less different". It is therefore imperative not to apply the ABC method blindly and verify that simulations produce data comparable with the observation.
### Exercise 1
Compare the simulations in the reference table with the data from *octomanati* and *rinocaracol*:
* Use Principal Component Analysis (`princomp()`) and plot different PCs (i.e. change code `pc<-c(3,4)`) of the simulations.
* Project target data in the plot (`princomp()`). Are simulations generating data similar to the target data?
* Compare the summary statistics values between target and simulations. Which are the summary statistics that do not fit?
* Propose some other model and discuss why do you think it would solve this problem.
```{r eval=T, echo=T}
target <- readRDS(file="data/target.RDS")
ref_table_1 <- readRDS(file="data/ref_table_1.RDS")
```
```{r prior_checking, exercise=TRUE, exercise.lines=20}
num_of_sims <- 10000
# Principal Component Analysis
PCA_result <- princomp(ref_table_1[,c("S","PI","NH","TD","FLD")])
pc<-c(1,2)
plot(PCA_result$scores[seq_len(num_of_sims),pc[1]],
PCA_result$scores[seq_len(num_of_sims),pc[2]],
xlab=paste("PCA",pc[1]),ylab=paste("PCA",pc[2]))
# projection of target data
PCA_target <- predict(PCA_result, target[,c("S","PI","NH","TD","FLD")])
points(PCA_target[,pc[1]],PCA_target[,pc[2]],col=c("yellow","orange"))
```
```{r prior_checking-solution}
num_of_sims <- 10000
# Principal Component Analysis
PCA_result <- princomp(ref_table_1[,c("S","PI","NH","TD","FLD")])
pc<-c(3,4)
plot(PCA_result$scores[seq_len(num_of_sims),pc[1]],
PCA_result$scores[seq_len(num_of_sims),pc[2]],
xlab=paste("PCA",pc[1]),ylab=paste("PCA",pc[2]))
# projection of target data
PCA_target <- predict(PCA_result, target[,c("S","PI","NH","TD","FLD")])
points(PCA_target[,pc[1]],PCA_target[,pc[2]],col=c("yellow","orange"))
plot(ref_table_1$TD[seq_len(num_of_sims)],
ref_table_1$FLD[seq_len(num_of_sims)],
xlab="Tajima's D",ylab="Fu and Li's D",
xlim=c(-3,4),ylim=c(-5,3))
points(target$TD,target$FLD,col=c("yellow","orange"))
```
## Evaluation of the method: Cross-validation
#### (after running ABC & before applying to the data)
Evaluating the performance of the method can be done by dividing the reference table into a *training set* used for ABC and a *testing set* from which inferences will be obtained and compared to the true values. Alternatively, it is possible to do a *leave-one-out* cross-validation: one simulation is taken out from the reference table and the rest of the simulations are used to apply the ABC estimation on that simulation. This is repeated a large number of times and the estimates and true values are compared.
### Exercise 2
Use cross-validation function `cv4abc()` from `abc` package to evaluate the performance of the ABC method. This is a time consuming task, for the purpose of this exercise, use a low cross-validation sample size (e.g. `nval=20`; note that you want a much higher number for real analyses):
* Are the estimates OK? Can they be improved?
* Try with a large tolerance value (0.8) or using only Tajima's $D$ as summary statistic.
* How would you provide an objective comparison (rather than a visual comparison) of the true and estimated values? ([hint](https://en.wikipedia.org/wiki/Estimator))
```{r eval=T, echo=T}
target <- readRDS(file="data/target.RDS")
ref_table_1 <- readRDS(file="data/ref_table_1.RDS")
```
```{r cross_validation, exercise=TRUE, exercise.lines=25}
num_of_sims <- 10000
cross_validation <- cv4abc(param = ref_table_1[seq_len(num_of_sims),"theta"],
sumstat = _______,
nval = 20,
tol =_______,
method = "rejection")
```
```{r cross_validation-solution}
num_of_sims <- 10000
cross_validation <- cv4abc(param = ref_table_1[seq_len(num_of_sims),"theta"],
sumstat = ref_table_1[seq_len(num_of_sims),c("S","PI","NH","TD")],
nval = 20, tol = 0.1, method = "rejection")
plot(log10(cross_validation$true$P1),
log10(cross_validation$estim$tol0.1),
xlab = expression(log(theta)),
ylab = expression(log(hat(theta))),
xlim = c(-1,2), ylim = c(-1,2))
abline(a = 0, b = 1)
```
## Evaluation of the posterior: comparison with prior
#### (after applying ABC to the data)
The posterior probability combines the previous information (prior probability) with the new information provided by the data. If the data is not informative about the parameter of the model (respect to the information already present in the prior) there will be no or little difference between the distribution of prior and posterior probabilities. Therefore, it is imperative to verify that the posterior distribution changes with respect to the data. Otherwise, we could extract some concussions from the resulting posterior distribution unaware that our experiment and data did not provide any new insight to the question we are addressing.
This is important because priors can be used to incorporate previous information on the parameters. In the case of our population genetics example, we have some information on mutation rates (from pedigrees) and population sizes (from census) in *octomanati* and *rinocaracol*. Since $\theta = 2N\mu$, we could draw parameters $N$ and $\mu$ from prior probability distributions and combine them to get the prior of $\theta$. Our prior knowledge on *octomanati* and *rinocaracol* point us towards mitochondrial mutation rates between $3\times 10^{-6}$ and $10^{-4}$ per bp per generation (sequence length of 100bp) and effective population sizes between $1$ and $30$ million individuals. I have generated a reference table from such priors with the following code:
```{r eval=FALSE, echo=TRUE}
sample_size = 50
num_of_sims = 100000
mu_prime <- 10^runif(num_of_sims,min=-7.5,max=-6)
N_prime <- 10^runif(num_of_sims,min=6,max=7.5)
theta_prime <- 2*N_prime*mu_prime
msout <- ms(nsam = sample_size,
opt = "-t tbs",
tbs.matrix = cbind(theta_prime))
msout <- read.ms.output(txt=msout)
ref_table <- data.frame(theta=theta_prime,
S=S(msout),
PI=PI(msout),
NH=NH(msout),
TD=TajimaD(msout),
FLD=FuLiD(msout))
saveRDS(ref_table,file="data/ref_table_1b.RDS")
```
### Exercise 3
Using the new reference table, plot the posterior distribution from the ABC analysis [click `Run Code`]:
* Is the analysis informative about $\theta$?
* What is the summary statistics used? Is Tajima's $D$, on its own, informative about $\theta$? What is going on?
* Add to the plot the histogram with the prior probability distribution.
```{r eval=T, echo=T}
target <- readRDS(file="data/target.RDS")
ref_table_1b <- readRDS(file="data/ref_table_1b.RDS")
```
```{r unexpected_prior, exercise=TRUE, exercise.lines=30}
num_of_sims <- 100000
SS <- target[1,"TD"]
SS_prime <- ref_table_1b[seq_len(num_of_sims),"TD"]
theta_prime <- ref_table_1b$theta[seq_len(num_of_sims)]
tolerance <- 0.05
abc_result <- abc(target = SS,
param = theta_prime,
sumstat = SS_prime,
tol = tolerance,
method = "rejection")
theta_prime_accept <- as.vector(abc_result$unadj.values)
hist(x = log10(theta_prime_accept),
breaks = seq(-2,3,0.05),
col = "blue",
freq = FALSE,
ylim = c(0,1), xlim=c(-1,2),
xlab = expression(log(theta)),main="",
ylab = "probability density")
```
```{r unexpected_prior-hint}
num_of_sims <- 100000
SS <- target[1,"TD"]
SS_prime <- ref_table_1b[seq_len(num_of_sims),"TD"]
theta_prime <- ref_table_1b$theta[seq_len(num_of_sims)]
tolerance <- 0.05
abc_result <- abc(target = SS,
param = theta_prime,
sumstat = SS_prime,
tol = tolerance,
method = "rejection")
theta_prime_accept <- as.vector(abc_result$unadj.values)
hist(log10(theta_prime_accept),
breaks = seq(-2,3,0.05),
col = "blue",
freq = FALSE,
ylim = c(0,1), xlim=c(-1,2),
xlab = expression(log(theta)),main="",
ylab = "probability density")
hist(_______,
breaks = seq(-2,3,0.05),
col = "grey", freq=FALSE, add=______ )
```
```{r unexpected_prior-solution}
num_of_sims <- 100000
SS <- target[1,"TD"]
SS_prime <- ref_table_1b[seq_len(num_of_sims),"TD"]
theta_prime <- ref_table_1b$theta[seq_len(num_of_sims)]
tolerance <- 0.05
abc_result <- abc(target = SS,
param = theta_prime,
sumstat = SS_prime,
tol = tolerance,
method = "rejection")
theta_prime_accept <- as.vector(abc_result$unadj.values)
hist(x = log10(theta_prime_accept),
breaks = seq(-2,3,0.05),
col = "blue",
freq = FALSE,
ylim = c(0,1), xlim=c(-1,2),
xlab = expression(log(theta)),main="",
ylab = "probability density")
hist(x = log10(theta_prime),
breaks = seq(-2,3,0.05),
col = "grey", freq=FALSE, add=TRUE )
```
#
#### References