2.GoodPractices.Rmd

---
######################################
# Click on "Run Document" in RStudio #
######################################
title: "Good Practices in ABC"
author: "Miguel de Navascués"
output: learnr::tutorial
runtime: shiny_prerendered
bibliography: src/references.bib
---

These learning resources are available at [GitHub](https://github.com/mnavascues/ABCRFtutorial) & [Zenodo](http://doi.org/10.5281/zenodo.1435503).

## Setup (tl;dr)

This unit uses R packages `learnr` [@Schloerke2020], `phyclust` [@Chen2011] and `abc` [@Csillery2012].

```{r setup, include=TRUE}
library(learnr)
# function ms() (Hudson's coalescent simulator)
library(phyclust, quietly=T)
# functions for ABC
suppressMessages(library(abc, quietly=T))

# load data and reference table
target       <- readRDS(file="data/target.RDS")
ref_table_1  <- readRDS(file="data/ref_table_1.RDS")
ref_table_1b <- readRDS(file="data/ref_table_1b.RDS")
```

## Evaluation of the model and prior: Goodness of fit
#### (after generating the reference table & before running ABC)


As discussed in the previous unit, the threshold distance for the rejection algorithm is not set *a priori*. Instead a proportion of simulations closer to the observed (tolerance) is decided *a priori* and the furthest simulation among the ones kept defines *a posteriori* the threshold distance. This can be problematic. If the model and priors used for the simulation generate simulated data very different from observation, the ABC approach will still chose a proportion of them that are the "less different". It is therefore imperative not to apply the ABC method blindly and verify that simulations produce data comparable with the observation.


### Exercise 1

Compare the simulations in the reference table with the data from *octomanati* and *rinocaracol*:

* Use Principal Component Analysis (`princomp()`) and plot different PCs (i.e. change code `pc<-c(3,4)`) of the simulations.
* Project target data in the plot (`princomp()`). Are simulations generating data similar to the target data? 
* Compare the summary statistics values between target and simulations. Which are the summary statistics that do not fit? 
* Propose some other model and discuss why do you think it would solve this problem.


```{r eval=T, echo=T}
target      <- readRDS(file="data/target.RDS")
ref_table_1 <- readRDS(file="data/ref_table_1.RDS")
```


```{r prior_checking, exercise=TRUE, exercise.lines=20}
num_of_sims <- 10000

# Principal Component Analysis
PCA_result  <- princomp(ref_table_1[,c("S","PI","NH","TD","FLD")])
pc<-c(1,2) 
plot(PCA_result$scores[seq_len(num_of_sims),pc[1]],
     PCA_result$scores[seq_len(num_of_sims),pc[2]],
     xlab=paste("PCA",pc[1]),ylab=paste("PCA",pc[2]))
# projection of target data
PCA_target <- predict(PCA_result, target[,c("S","PI","NH","TD","FLD")])
points(PCA_target[,pc[1]],PCA_target[,pc[2]],col=c("yellow","orange"))

```


```{r prior_checking-solution}
num_of_sims <- 10000
# Principal Component Analysis
PCA_result  <- princomp(ref_table_1[,c("S","PI","NH","TD","FLD")])
pc<-c(3,4) 
plot(PCA_result$scores[seq_len(num_of_sims),pc[1]],
     PCA_result$scores[seq_len(num_of_sims),pc[2]],
     xlab=paste("PCA",pc[1]),ylab=paste("PCA",pc[2]))
# projection of target data
PCA_target <- predict(PCA_result, target[,c("S","PI","NH","TD","FLD")])
points(PCA_target[,pc[1]],PCA_target[,pc[2]],col=c("yellow","orange"))

plot(ref_table_1$TD[seq_len(num_of_sims)],
     ref_table_1$FLD[seq_len(num_of_sims)],
     xlab="Tajima's D",ylab="Fu and Li's D",
     xlim=c(-3,4),ylim=c(-5,3))
points(target$TD,target$FLD,col=c("yellow","orange"))
```

## Evaluation of the method: Cross-validation
#### (after running ABC & before applying to the data)

Evaluating the performance of the method can be done by dividing the reference table into a *training set* used for ABC and a *testing set* from which inferences will be obtained and compared to the true values. Alternatively, it is possible to do a *leave-one-out* cross-validation: one simulation is taken out from the reference table and the rest of the simulations are used to apply the ABC estimation on that simulation. This is repeated a large number of times and the estimates and true values are compared. 


### Exercise 2

Use cross-validation function `cv4abc()` from `abc` package to evaluate the performance of the ABC method. This is a time consuming task, for the purpose of this exercise, use a low cross-validation sample size (e.g. `nval=20`; note that you want a much higher number for real analyses):

* Are the estimates OK? Can they be improved?
* Try with a large tolerance value (0.8) or using only Tajima's $D$ as summary statistic.
* How would you provide an objective comparison (rather than a visual comparison) of the true and estimated values? ([hint](https://en.wikipedia.org/wiki/Estimator))

```{r eval=T, echo=T}
target      <- readRDS(file="data/target.RDS")
ref_table_1 <- readRDS(file="data/ref_table_1.RDS")
```


```{r cross_validation, exercise=TRUE, exercise.lines=25}
num_of_sims <- 10000

cross_validation <- cv4abc(param = ref_table_1[seq_len(num_of_sims),"theta"],
                           sumstat = _______,
                           nval = 20,
                           tol =_______,
                           method = "rejection")


```

```{r cross_validation-solution}
num_of_sims <- 10000

cross_validation <- cv4abc(param = ref_table_1[seq_len(num_of_sims),"theta"],
                           sumstat = ref_table_1[seq_len(num_of_sims),c("S","PI","NH","TD")],
                           nval = 20, tol = 0.1, method = "rejection")

plot(log10(cross_validation$true$P1),
     log10(cross_validation$estim$tol0.1),
     xlab = expression(log(theta)),
     ylab = expression(log(hat(theta))),
     xlim = c(-1,2), ylim = c(-1,2))
abline(a = 0, b = 1)

```


## Evaluation of the posterior: comparison with prior
#### (after applying ABC to the data)

The posterior probability combines the previous information (prior probability) with the new information provided by the data. If the data is not informative about the parameter of the model (respect to the information already present in the prior) there will be no or little difference between the distribution of prior and posterior probabilities. Therefore, it is imperative to verify that the posterior distribution changes with respect to the data. Otherwise, we could extract some concussions from the resulting posterior distribution unaware that our experiment and data did not provide any new insight to the question we are addressing.


This is important because priors can be used to incorporate previous information on the parameters. In the case of our population genetics example, we have some information on mutation rates (from pedigrees) and population sizes (from census) in *octomanati* and *rinocaracol*. Since $\theta = 2N\mu$, we could draw parameters $N$ and $\mu$ from prior probability distributions and combine them to get the prior of $\theta$. Our prior knowledge on *octomanati* and *rinocaracol* point us towards mitochondrial mutation rates between $3\times 10^{-6}$ and $10^{-4}$ per bp per generation (sequence length of 100bp) and effective population sizes between $1$ and $30$ million individuals. I have generated a reference table from such priors with the following code:

```{r eval=FALSE, echo=TRUE}
sample_size = 50
num_of_sims = 100000
mu_prime <- 10^runif(num_of_sims,min=-7.5,max=-6)
N_prime <- 10^runif(num_of_sims,min=6,max=7.5)
theta_prime <- 2*N_prime*mu_prime
msout <- ms(nsam = sample_size,
            opt = "-t tbs",
            tbs.matrix = cbind(theta_prime))
msout <- read.ms.output(txt=msout)
ref_table <- data.frame(theta=theta_prime,
                        S=S(msout),
                        PI=PI(msout),
                        NH=NH(msout),
                        TD=TajimaD(msout),
                        FLD=FuLiD(msout))
saveRDS(ref_table,file="data/ref_table_1b.RDS")
```


### Exercise 3

Using the new reference table, plot the posterior distribution from the ABC analysis [click `Run Code`]:

* Is the analysis informative about $\theta$?
* What is the summary statistics used? Is Tajima's $D$, on its own, informative about $\theta$? What is going on?
* Add to the plot the histogram with the prior probability distribution. 

```{r eval=T, echo=T}
target       <- readRDS(file="data/target.RDS")
ref_table_1b <- readRDS(file="data/ref_table_1b.RDS")
```

```{r unexpected_prior, exercise=TRUE, exercise.lines=30}
num_of_sims <- 100000

SS          <- target[1,"TD"]
SS_prime    <- ref_table_1b[seq_len(num_of_sims),"TD"]
theta_prime <- ref_table_1b$theta[seq_len(num_of_sims)]

tolerance <- 0.05
abc_result <- abc(target = SS,
                  param = theta_prime,
                  sumstat = SS_prime,
                  tol = tolerance,
                  method = "rejection")

theta_prime_accept     <- as.vector(abc_result$unadj.values)
hist(x = log10(theta_prime_accept),
     breaks = seq(-2,3,0.05),
     col = "blue",
     freq = FALSE,
     ylim = c(0,1), xlim=c(-1,2),
     xlab = expression(log(theta)),main="",
     ylab = "probability density")
```


```{r unexpected_prior-hint}
num_of_sims <- 100000 

SS          <- target[1,"TD"]
SS_prime    <- ref_table_1b[seq_len(num_of_sims),"TD"]
theta_prime <- ref_table_1b$theta[seq_len(num_of_sims)]

tolerance <- 0.05
abc_result <- abc(target = SS,
                  param = theta_prime,
                  sumstat = SS_prime,
                  tol = tolerance,
                  method = "rejection")

theta_prime_accept     <- as.vector(abc_result$unadj.values)
hist(log10(theta_prime_accept),
     breaks = seq(-2,3,0.05),
     col = "blue",
     freq = FALSE,
     ylim = c(0,1), xlim=c(-1,2),
     xlab = expression(log(theta)),main="",
     ylab = "probability density")
hist(_______,
     breaks = seq(-2,3,0.05),
     col = "grey", freq=FALSE, add=______ )
```


```{r unexpected_prior-solution}
num_of_sims <- 100000 

SS          <- target[1,"TD"]
SS_prime    <- ref_table_1b[seq_len(num_of_sims),"TD"]
theta_prime <- ref_table_1b$theta[seq_len(num_of_sims)]

tolerance <- 0.05
abc_result <- abc(target = SS,
                  param = theta_prime,
                  sumstat = SS_prime,
                  tol = tolerance,
                  method = "rejection")

theta_prime_accept     <- as.vector(abc_result$unadj.values)
hist(x = log10(theta_prime_accept),
     breaks = seq(-2,3,0.05),
     col = "blue",
     freq = FALSE,
     ylim = c(0,1), xlim=c(-1,2),
     xlab = expression(log(theta)),main="",
     ylab = "probability density")
hist(x = log10(theta_prime),
     breaks = seq(-2,3,0.05),
     col = "grey", freq=FALSE, add=TRUE )
```


#   

#### References