05_sf2_thresholds.Rmd

---
title: "Search for selective sweeps with SweepFinder2"
output:
  github_document:
    pandoc_args: --webtex
bibliography: bibliography.bib
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(fig.width=12, fig.height=8,
                      echo=FALSE, warning=FALSE, message=FALSE)
library(tidyverse)
source("R/constants.R")
```

## SweepFinder2 analysis on real data

Input data for SweepFinder2 analyses was generated using the python script [vcf2dadi.py](bin/vcf2dadi.py) with the sweepfinder output option as follows;
```bash
vcf2dadi.py --sweepfinder -p clean.poplist.${pop}.txt ${contig}.vcf | awk '$2>0' > ${pop}/${contig}.af
```
where the file `clean.poplist.${pop}.txt` is an outlier free list of samples for each of the five individual reefs as well as a combined population (called `nomi`) representing all northern reefs.  

SweepFinder was then run in grid mode for each contig in each population with grid points spaced every 1000bp across the genome. (See scripts [02_empirical_freq_spectrum.sh](hpc/SF2/02_empirical_freq_spectrum.sh) and [03_find_sweeps.sh](hpc/SF2/03_find_sweeps.sh)).

Full sweepfinder results were then converted to discrete regions by amalgamating adjacent values greater than a threshold of 10 (See [04_sweeps_to_gff.sh](hpc/SF2/04_sweeps_to_gff.sh)).  For each such region we assign its CLR value to the maximum value in the region.

## Generation of neutral background CLR distribution

Sweepfinder2 calculates a composite likelihood ratio (CLR) statistic across a grid of positions distributed evenly across the entire genome.  Since the CLR incorporates information from non-independent sites it cannot be used to directly calculate probabilities of error.  An even more important issue is the fact that high CLR values can occur under neutral evolution due to demographic changes such as recent population bottlenecks or admixture between historically separate populations.

To account for these effects we ran simulations using the program `ms` under various demographic scenarios and ran SweepFinder2 analyses on the resulting outputs. These form a neutral model against which we were able to compare SweepFinder2 results obtained from observed alleles in _A. tenuis_ populations from Magnetic Island and Northern populations.

The full suite of demographic models used included all [dadi models](04_dadi.md) as well as population trajectories inferred by [MSMC](03_msmc.md). In the case of dadi models we generated `ms` commands using the included script [dadi_to_ms.py](hpc/ddadi/dadi_to_ms.py) and checked that these were consistent by running the script [check_ms.py](hpc/dadi/check_ms.py).  This latter script generates data using ms and plots residuals based on this data and the corresponding dadi model.  We visually inspected these residuals plots to ensure they were centered on zero, had no biases and low variance. (See for example [this plot](hpc/dadi/mscore_residuals/isolation_asym_mig.png))

For msmc models we use the population trajectories calculated for Magnetic Island and Fitzroy Island individuals as input to the script [msmc2ms.py](hpc/ms/msmc2ms.py) by Daniel Weissman.  Outputs from this were converted to Sweepfinder input using awk and full details are provided in this [bash script](hpc/ms/02_simulate_msmc.sh).

Once ms commands had been generated for all models they were used to generate a neutral background for the CLR statistic as follows;

- `ms` was run on 500 x 1Mb chunks for each demographic model and outputs processed with `awk` to convert to SweepFinder2 format (see scripts [02_simulate_msmc.sh](hpc/ms/02_simulate_msmc.sh) and [03_simulate_dadi.sh](hpc/ms/03_simulate_dadi.sh)). 
- SweepFinder2 was run on each chunk using an identical uniform grid across all simulations (see [04_combine.sh](hpc/ms/04_combine.sh) and [05_find_sweeps.sh](hpc/ms/05_find_sweeps.sh))
- SweepFinder2 outputs were then processed to find distinct regions of the genome where the CLR statistic was above a background noise threshold (set to 10).  This step is required in order to reduce multiple adjacent grid positions into a single independent value and was performed using the script [sf2gff.py](bin/sf2gff.py).  For each such region we assign its CLR value to the maximum value within the region.

## False discovery rate calculation

Assuming that neutral demographic models simulated with `ms` represent an accurate background distribution we calculated an empirical false discovery rate as follows; 

- Combine sweep loci from real data with those obtained under neutral simulations into a single list
- Sort the list by CLR value from lowest to highest
- For a given threshold value, $T$ count the number of loci in the neutral dataset with $CLR>T$ and call this $F$ (these represent false discoveries). Also count the number of real loci with $CLR>T$ and call this $R$.  The false discovery rate for a given $T$ is then;
$$FDR_{T} = \frac{F}{F+R}$$
- The total number of sweep loci will differ slightly between simulated and real data but our goal is to compare these datasets purely on the shape of their distribution.  We therefore apply a correction factor $s = \frac{N_{real}}{N_{neutral}}$ and use $FDR_{T corrected}=s FDR_{T}$. This will increase the FDR value if the real dataset has many more loci above the noise threshold than the neutral dataset and vice versa. 

```{r}
read_ms_twopop_sweeps <- function(model){
  mpath <- paste("hpc/ms/ms_",model,"_mi_10_sweeps.gff",sep="")
  mis <- read_tsv(mpath,col_names = gff_cols,col_types = cols()) %>% 
    add_column(model=model) %>% 
    add_column(population="mi")

  mpath <- paste("hpc/ms/ms_",model,"_nomi_10_sweeps.gff",sep="")  
  nomis <- read_tsv(mpath,col_names = gff_cols,col_types = cols()) %>% 
    add_column(model=model) %>% 
    add_column(population="nomi")
    
  rbind(mis,nomis)
}

calc_fdr_pop <- function(indata,pop,decoy_model){

  data <- indata %>% filter(model %in% c(decoy_model,'real')) %>% filter(population==pop)
  ndecoy <- data %>% filter(model==decoy_model) %>% nrow()
  nreal <- data %>% filter(model=='real') %>% nrow()
  
  decoy_scaling <- nreal/ndecoy
  
#  print(decoy_scaling)
  
  data %>% ungroup() %>% 
    filter(population==pop) %>% 
    filter(model %in% c(decoy_model,"real")) %>% 
    mutate(is_decoy=ifelse(model==decoy_model,1,0)) %>% 
    arrange(desc(LR)) %>% 
    mutate(n_decoy = cumsum(is_decoy)) %>% 
    mutate(decoy_ratio = (n_decoy*decoy_scaling)/(row_number())) %>% 
    mutate(rn =row_number())
}

calc_fdr <- function(decoy_model,data){
  mi_fdr_data <- data %>% calc_fdr_pop('mi',decoy_model)
  nomi_fdr_data <- data %>% calc_fdr_pop('nomi',decoy_model)
  fdr_d <- rbind(mi_fdr_data,nomi_fdr_data)
  fdr_d %>% filter(model=='real') %>% 
    mutate(model=decoy_model)#%>% add_column(model=model)
}
```


```{r}
gff_cols <- c('scaffold','source','feature','start','end','score','strand','phase','attributes')

models <- (list.files("hpc/ms/","*nomi_10_sweeps.gff") %>% str_match("ms_(.*)_nomi"))[,2]


mi_sweeps <- read_tsv("hpc/SF2/mi_10_sweeps.gff",col_names = gff_cols,col_types = cols()) %>% 
    add_column(model='real') %>% 
    add_column(population="mi")

nomi_sweeps <- read_tsv("hpc/SF2/nomi_10_sweeps.gff",col_names = gff_cols,col_types = cols()) %>% 
    add_column(model='real') %>% 
    add_column(population="nomi")

discrete_sweeps <- do.call(rbind,lapply(models,read_ms_twopop_sweeps))
discrete_sweeps <- rbind(mi_sweeps,nomi_sweeps,discrete_sweeps)
discrete_sweeps <- discrete_sweeps %>% mutate(LR=score)


discrete_sweeps_fdr <- calc_fdr("asym_mig",discrete_sweeps)


discrete_fdr_d_all <- do.call(rbind,lapply(c("msmc",models),calc_fdr,discrete_sweeps))
```

This leads to the following false discovery rates for various neutral models for the Magnetic Island and Northern populations respectively. Red vertical lines represent CLR threshold values for which the FDR reaches 10%.

```{r}
library(ggpubr)


two_epoch_models <- c("no_mig","sym_mig","asym_mig","priorsize_asym_mig","asym_mig_size","isolation_asym_mig","msmc")
named_model_order <- 1:7
names(named_model_order) <- two_epoch_models

discrete_fdr_d_all_2e <- discrete_fdr_d_all %>% 
  filter(model %in% two_epoch_models) %>% 
  mutate(model_order = named_model_order[model]) %>% 
  mutate(population = ifelse(population=="mi","Magnetic Island","North"))

discrete_fdr_d_all_2e$model <- factor(discrete_fdr_d_all_2e$model, levels = two_epoch_models)

ggplot(discrete_fdr_d_all_2e,aes(x=LR)) + 
  geom_line(aes(y=decoy_ratio)) + 
  xlim(0,250) + ylim(0,1) +
  facet_grid(model~population) +
  geom_hline(yintercept=0.1, color='red') + 
  ylab("False Discovery Rate") +
  xlab("Sweepfinder CL Threshold, T") + theme_pubclean() +
  theme(strip.text = element_text(size = 8))

```
```{r}
discrete_fdr_d_all_2e %>% 
  filter(model=='isolation_asym_mig') %>% 
  filter(population=="North") %>% 
ggplot(aes(x=LR)) + 
  geom_line(aes(y=decoy_ratio)) + 
  xlim(0,250) + ylim(0,1) +
  facet_grid(model~population) +
  geom_hline(yintercept=0.1, color='red') + 
  ylab("False Discovery Rate") +
  xlab("Sweepfinder CL Threshold, T") + theme_pubclean() +
  theme(strip.text = element_text(size = 8))

ggsave("figures/iso_asym_mig_fdr.png",width=6,height=4)
```

For the analysis including all northern reefs together we used a CLR threshold of 100 whereas when looking at reefs separately we used CLR threshold of 50.  What FDR do these correspond to

```{r}
t100 <- discrete_fdr_d_all_2e %>% group_by(model,population) %>%
  filter(LR>=100) %>% summarise(fdr = max(decoy_ratio),threshold=100) %>% 
  filter(model=='isolation_asym_mig', population=='North')
t50 <- discrete_fdr_d_all_2e %>% group_by(model,population) %>%
  filter(LR>=50) %>% summarise(fdr = max(decoy_ratio),threshold=50) %>% 
  filter(model=='isolation_asym_mig', population=='North')
knitr::kable(rbind(t100,t50))
```