3.ClassicalABC.Rmd

---
######################################
# Click on "Run Document" in RStudio #
######################################
title: "Classical ABC"
author: "Miguel de Navascués"
output: learnr::tutorial
runtime: shiny_prerendered
bibliography: src/references.bib
---

These learning resources are available at [GitHub](https://github.com/mnavascues/ABCRFtutorial) & [Zenodo](http://doi.org/10.5281/zenodo.1435503).

## Setup (tl;dr)


This unit uses R packages `learnr` [@Schloerke2020], `abc` [@Csillery2012] and `weights` [@Pasek2018].


```{r setup, include=TRUE}
library(learnr)
# functions for ABC
suppressMessages(library(abc, quietly=T))
# function for weighted histogram
suppressMessages(library(weights, quietly=T))

# load data and reference table
target      <- readRDS(file="data/target.RDS")
ref_table_1 <- readRDS(file="data/ref_table_1.RDS")
ref_table_2 <- readRDS(file="data/ref_table_2.RDS")
```

## Regression ABC

As we have seen in unit 1, there is a trade-off from adding some tolerance to the rejection algorithm: with a big tolerance the results tend to the prior, with a small tolerance the is a loss of precision in the posterior distribution estimate. This problem is partially corrected by an additional local regression step [@Beaumont2002]. The local regression is between the parameter values and the summary statistics from the simulations accepted after the rejection step. Parameter values from those simulations are then *adjusted* by projection using the regression. The regression algorithm allows to increase the proportion of simulations accepted correcting to some degree the *widening* of the posterior distribution.

### Exercise 1

Explore the application of the regression algorithm implemented in `abc()`.

* Compare the posterior probability distribution estimated from adjusted and from unadjusted (original) values.
* Explore how this difference changes with the value of tolerance (for instance `tolerance<-0.8`).
* Try also with Tajima's $D$ as summary statistic (`TD`).
* Try using using $\pi$ and Tajima's $D$ (i.e. `SS <-target[1,c("PI","TD")]`) and skip the scatter plot.
* Try using $\pi$ and some completely uninformative summary statistic (i.e. generate a fake random summary statistic using `runif()` for both reference table and target data).
* Do you think a linear regression makes sense in this case? Open the help for `abc()` function and read about the option `method="neuralnet"`.

```{r eval=T, echo=T}
target      <- readRDS(file="data/target.RDS")
ref_table_1 <- readRDS(file="data/ref_table_1.RDS")
```


```{r abc_regression, exercise=TRUE, exercise.lines=30}
num_of_sims <- 10000 # use a subset of simulation from a total of 1 000 000

SS          <- target[1,"PI"]
SS_prime    <- ref_table_1[seq_len(num_of_sims),"PI"]
theta_prime <- ref_table_1$theta[seq_len(num_of_sims)]

tolerance <- 0.2
abc_result <- abc(target = SS,
                  param = theta_prime,
                  sumstat = SS_prime,
                  tol = tolerance,
                  transf = "log", hcorr = F,
                  method = "loclinear")

theta_prime_accept     <- as.vector(abc_result$unadj.values)
theta_prime_accept_reg <- as.vector(abc_result$adj.values)
SS_prime_accept        <- abc_result$ss
local_regression <- lm(log10(theta_prime_accept)~SS_prime_accept,
                       weights=abc_result$weights)
plot(log10(theta_prime), SS_prime,
     xlab = expression(log[10](theta*"'")),
     ylab = "SS'",
     ylim = c(0,30))
abline(h = SS, col = "blue")
points(log10(theta_prime_accept), SS_prime_accept, col = "blue")
abline(a = -local_regression$coefficients[1]/local_regression$coefficients[2],
       b = 1/local_regression$coefficients[2],
       col = "yellow",
       lwd = 3)
points(x = log10(theta_prime_accept)[1:10],
       y = SS_prime_accept[1:10],col = "yellow")
arrows(x0 = log10(theta_prime_accept)[1:10],
       y0 = SS_prime_accept[1:10],
       x1 = log10(theta_prime_accept_reg)[1:10],
       y1 = SS,
       col = "yellow",
       length = 0.1)
hist(x = log10(theta_prime),
     breaks = seq(-1,2,0.05),
     col = "grey",
     freq = FALSE,
     ylim = c(0,4),
     main = "",
     xlab = expression(log[10](theta)),
     ylab = "probability density")
hist(x = log10(theta_prime_accept),
     breaks = seq(-1,2,0.05),
     col = "blue", freq=FALSE, add=TRUE )
wtd.hist(x = log10(theta_prime_accept_reg),
         breaks = seq(-1.5,2,0.05),
         col = "yellow", freq=FALSE, add=TRUE,
         weight = abc_result$weights)


```

```{r abc_regression-solution}
num_of_sims <- 10000 # use a subset of simulation from a total of 1 000 000
SS          <- SS <-target[1,c("PI","TD")]
SS_prime    <- ref_table_1[seq_len(num_of_sims),c("PI","TD")]
theta_prime <- ref_table_1$theta[seq_len(num_of_sims)]
tolerance <- 0.1
abc_result <- abc(target = SS,
                  param = theta_prime,
                  sumstat = SS_prime,
                  tol = tolerance,
                  transf = "log", hcorr = F,
                  method = "loclinear")
theta_prime_accept     <- as.vector(abc_result$unadj.values)
theta_prime_accept_reg <- as.vector(abc_result$adj.values)
hist(x = log10(theta_prime),
     breaks = seq(-1,2,0.05),
     col = "grey",
     freq = FALSE,
     ylim = c(0,4),
     main = "",
     xlab = expression(log[10](theta)),
     ylab = "probability density")
hist(x = log10(theta_prime_accept),
     breaks = seq(-1,2,0.05),
     col = "blue", freq=FALSE, add=TRUE )
wtd.hist(x = log10(theta_prime_accept_reg),
         breaks = seq(-1.5,2,0.05),
         col = "yellow", freq=FALSE, add=TRUE,
         weight = abc_result$weights)


```


## Practice classical ABC


### Exercise 2

Using ABC, estimate $\theta$ for *octomanati* and *rinocaracol*. Provide a measure of the uncertainty of the estimate:

```{r eval=T, echo=T}
target      <- readRDS(file="data/target.RDS")
ref_table_1 <- readRDS(file="data/ref_table_1.RDS")
```


```{r practice_abc, exercise=TRUE, exercise.lines=30}
octomanati_result <- abc(_____)

______(octomanati_result$adj.values,
             weights = octomanati_result$weights,
             probs = c(0.5, 0.025, 0.975))        
        
```


```{r practice_abc-solution}
octomanati_result <- abc(target = target["octomanati",c("S","PI","NH","TD","FLD")],
                         param = ref_table_1[,"theta"],
                         sumstat = ref_table_1[,c("S","PI","NH","TD","FLD")],
                         tol = 0.01,transf = "log",method = "loclinear")
rinocaracol_result <- abc(target = target["rinocaracol",c("S","PI","NH","TD","FLD")],
                          param = ref_table_1[,"theta"],
                          sumstat = ref_table_1[,c("S","PI","NH","TD","FLD")],
                          tol = 0.01,transf = "log",method = "loclinear")

cat("theta estimate (point estimate and 95%CI) for octomanati")
wtd.quantile(octomanati_result$adj.values,
             weights = octomanati_result$weights,
             probs = c(0.5, 0.025, 0.975))
cat("theta estimate (point estimate and 95%CI) for rinocaracol")
wtd.quantile(rinocaracol_result$adj.values,
             weights = rinocaracol_result$weights,
             probs = c(0.5, 0.025, 0.975))
             

```

## Model choice

We are going to consider two models. A constant size model with parameter $\theta = 2N\mu$  and a population with one instantaneous change of size with parameters $\theta_0 = 2N_0\mu$ (present scaled mutation rate), $\tau = t\mu$ (mutation scaled time of population size change) and $\theta_1 = 2N_1\mu$ (past scaled mutation rate). We already have use the simulations for the first model (reference table in file `ref_table_1.RDS`), the reference table for the second model is available in file `ref_table_2.RDS` which has been generated with the following code:


```{r eval=FALSE, echo=TRUE}
sample_size <- 50
num_of_sims <- 100000
theta0 <- 10^runif(num_of_sims,min=-1,max=2)
theta1 <- 10^runif(num_of_sims,min=-1,max=2)
tau    <- runif(num_of_sims,min=0,max=2)
msout <- ms(nsam = sample_size,
            opt = "-t tbs -eN tbs tbs -T",
            tbs.matrix = cbind(theta0, tau/theta0, theta1/theta0))
tmrca <- theta0*sapply(read.tree(text = msout[grep("//",msout)+1]),
                       get.rooted.tree.height)
msout <- read.ms.output(txt=msout)
ref_table <- data.frame(theta0=theta0,
                        theta1=theta1,
                        tau=tau,
                        tmrca=tmrca,
                        S=S(msout),
                        PI=PI(msout),
                        NH=NH(msout),
                        TD=TajimaD(msout),
                        FLD=FuLiD(msout))
saveRDS(ref_table,file="data/ref_table_2.RDS")
```

### Exercise 3

Explore the goodness of fit for this new model:

```{r eval=T, echo=T}
target      <- readRDS(file="data/target.RDS")
ref_table_2 <- readRDS(file="data/ref_table_2.RDS")
```


```{r goodness_of_fit_model2, exercise=TRUE, exercise.lines=30}
num_of_sims <- 10000

# Principal Component Analysis
PCA_result  <- princomp(ref_table_2[,c("S","PI","NH","TD","FLD")])
pc<-c(1,2) 
plot(PCA_result$scores[seq_len(num_of_sims),pc[1]],
     PCA_result$scores[seq_len(num_of_sims),pc[2]],
     xlab=paste("PCA",pc[1]),ylab=paste("PCA",pc[2]))
PCA_target <- predict(____)

plot(____)
```

```{r goodness_of_fit_model2-solution}
num_of_sims <- 10000

# Principal Component Analysis
PCA_result  <- princomp(ref_table_2[,c("S","PI","NH","TD","FLD")])
pc<-c(1,2) 
plot(PCA_result$scores[seq_len(num_of_sims),pc[1]],
     PCA_result$scores[seq_len(num_of_sims),pc[2]],
     xlab=paste("PCA",pc[1]),ylab=paste("PCA",pc[2]))
# projection of target data
PCA_target <- predict(PCA_result, target[,c("S","PI","NH","TD","FLD")])
points(PCA_target[,pc[1]],PCA_target[,pc[2]],col=c("yellow","orange"))

plot(ref_table_2$TD[seq_len(num_of_sims)],
     ref_table_2$FLD[seq_len(num_of_sims)],
     xlab="Tajima's D", ylab="Fu and Li's D")
points(target$TD, target$FLD, col=c("yellow","orange"))
```

ABC model choice follows a similar process as parameter inference: references tables are pooled together and rejection step is applied to keep simulations closer to the observed data (based in summary statistics). The regression step is different, model is treated as a qualitative variable and a logistic regression is used to estimate posterior probabilities of the model. 

### Exercise 4

Use function `postpr()` from `abc` package to calculate posterior probabilities for each model for *octomanati* and *rinocaracol*:

```{r eval=T, echo=T}
target      <- readRDS(file="data/target.RDS")
ref_table_1 <- readRDS(file="data/ref_table_1.RDS")
ref_table_2 <- readRDS(file="data/ref_table_2.RDS")
```

```{r model_choice, exercise=TRUE, exercise.lines=30}
num_of_sims <- 10000
model <- c(rep("constant population size",num_of_sims),rep("population size change",num_of_sims))
sumstats <- rbind(ref_table_1[seq_len(num_of_sims),c("S","PI","NH","TD","FLD")],
                  ref_table_2[seq_len(num_of_sims),c("S","PI","NH","TD","FLD")])


```

```{r model_choice-solution}
num_of_sims <- 10000
model <- c(rep("constant population size",num_of_sims),rep("population size change",num_of_sims))
sumstats <- rbind(ref_table_1[seq_len(num_of_sims),c("S","PI","NH","TD","FLD")],
                  ref_table_2[seq_len(num_of_sims),c("S","PI","NH","TD","FLD")])

model_choice_octomanati <- postpr(target=target["octomanati",], 
                                  index=model,
                                  sumstat=sumstats,
                                  tol=0.1,
                                  method="mnlogistic")
model_choice_rinocaracol <- postpr(target=target["rinocaracol",], 
                                  index=model,
                                  sumstat=sumstats,
                                  tol=0.1,
                                  method="mnlogistic")
# Calculate Bayes Factor (with prior probability 0.5 for each model)
octomanati_K  <- model_choice_octomanati$pred[2]/model_choice_octomanati$pred[1]
rinocaracol_K <- model_choice_rinocaracol$pred[2]/model_choice_rinocaracol$pred[1]

cat("Model posterior probabilities for octomanati:"); model_choice_octomanati$pred
cat("Corresponding Bayes factor is:"); octomanati_K

cat("Model posterior probabilities for rinocaracol:"); model_choice_rinocaracol$pred
cat("Corresponding Bayes factor is:"); rinocaracol_K
```

In Bayesian statistics the support of one model over the alternative is quantified using the [Bayes factor (K)](https://en.wikipedia.org/wiki/Bayes_factor). The strength of the support is often evaluated following Jeffrey's scale of interpretation:

K    | Strength of evidence for Model 1
-----|--------------------------------
$<10^0$ | Support for the Model 2
$10^0$ to $10^{0.5}$ | Weak
$10^{0.5}$ to $10^{1}$ | Substantial
$10^1$ to $10^{1.5}$ | Strong
$10^{1.5}$ to $10^2$ | Very strong
$>10^2$ | Decisive


#   

#### References