-
Notifications
You must be signed in to change notification settings - Fork 3
/
05_sf2_thresholds.Rmd
179 lines (131 loc) · 9.63 KB
/
05_sf2_thresholds.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
---
title: "Search for selective sweeps with SweepFinder2"
output:
github_document:
pandoc_args: --webtex
bibliography: bibliography.bib
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(fig.width=12, fig.height=8,
echo=FALSE, warning=FALSE, message=FALSE)
library(tidyverse)
source("R/constants.R")
```
## SweepFinder2 analysis on real data
Input data for SweepFinder2 analyses was generated using the python script [vcf2dadi.py](bin/vcf2dadi.py) with the sweepfinder output option as follows;
```bash
vcf2dadi.py --sweepfinder -p clean.poplist.${pop}.txt ${contig}.vcf | awk '$2>0' > ${pop}/${contig}.af
```
where the file `clean.poplist.${pop}.txt` is an outlier free list of samples for each of the five individual reefs as well as a combined population (called `nomi`) representing all northern reefs.
SweepFinder was then run in grid mode for each contig in each population with grid points spaced every 1000bp across the genome. (See scripts [02_empirical_freq_spectrum.sh](hpc/SF2/02_empirical_freq_spectrum.sh) and [03_find_sweeps.sh](hpc/SF2/03_find_sweeps.sh)).
Full sweepfinder results were then converted to discrete regions by amalgamating adjacent values greater than a threshold of 10 (See [04_sweeps_to_gff.sh](hpc/SF2/04_sweeps_to_gff.sh)). For each such region we assign its CLR value to the maximum value in the region.
## Generation of neutral background CLR distribution
Sweepfinder2 calculates a composite likelihood ratio (CLR) statistic across a grid of positions distributed evenly across the entire genome. Since the CLR incorporates information from non-independent sites it cannot be used to directly calculate probabilities of error. An even more important issue is the fact that high CLR values can occur under neutral evolution due to demographic changes such as recent population bottlenecks or admixture between historically separate populations.
To account for these effects we ran simulations using the program `ms` under various demographic scenarios and ran SweepFinder2 analyses on the resulting outputs. These form a neutral model against which we were able to compare SweepFinder2 results obtained from observed alleles in _A. tenuis_ populations from Magnetic Island and Northern populations.
The full suite of demographic models used included all [dadi models](04_dadi.md) as well as population trajectories inferred by [MSMC](03_msmc.md). In the case of dadi models we generated `ms` commands using the included script [dadi_to_ms.py](hpc/ddadi/dadi_to_ms.py) and checked that these were consistent by running the script [check_ms.py](hpc/dadi/check_ms.py). This latter script generates data using ms and plots residuals based on this data and the corresponding dadi model. We visually inspected these residuals plots to ensure they were centered on zero, had no biases and low variance. (See for example [this plot](hpc/dadi/mscore_residuals/isolation_asym_mig.png))
For msmc models we use the population trajectories calculated for Magnetic Island and Fitzroy Island individuals as input to the script [msmc2ms.py](hpc/ms/msmc2ms.py) by Daniel Weissman. Outputs from this were converted to Sweepfinder input using awk and full details are provided in this [bash script](hpc/ms/02_simulate_msmc.sh).
Once ms commands had been generated for all models they were used to generate a neutral background for the CLR statistic as follows;
- `ms` was run on 500 x 1Mb chunks for each demographic model and outputs processed with `awk` to convert to SweepFinder2 format (see scripts [02_simulate_msmc.sh](hpc/ms/02_simulate_msmc.sh) and [03_simulate_dadi.sh](hpc/ms/03_simulate_dadi.sh)).
- SweepFinder2 was run on each chunk using an identical uniform grid across all simulations (see [04_combine.sh](hpc/ms/04_combine.sh) and [05_find_sweeps.sh](hpc/ms/05_find_sweeps.sh))
- SweepFinder2 outputs were then processed to find distinct regions of the genome where the CLR statistic was above a background noise threshold (set to 10). This step is required in order to reduce multiple adjacent grid positions into a single independent value and was performed using the script [sf2gff.py](bin/sf2gff.py). For each such region we assign its CLR value to the maximum value within the region.
## False discovery rate calculation
Assuming that neutral demographic models simulated with `ms` represent an accurate background distribution we calculated an empirical false discovery rate as follows;
- Combine sweep loci from real data with those obtained under neutral simulations into a single list
- Sort the list by CLR value from lowest to highest
- For a given threshold value, $T$ count the number of loci in the neutral dataset with $CLR>T$ and call this $F$ (these represent false discoveries). Also count the number of real loci with $CLR>T$ and call this $R$. The false discovery rate for a given $T$ is then;
$$FDR_{T} = \frac{F}{F+R}$$
- The total number of sweep loci will differ slightly between simulated and real data but our goal is to compare these datasets purely on the shape of their distribution. We therefore apply a correction factor $s = \frac{N_{real}}{N_{neutral}}$ and use $FDR_{T corrected}=s FDR_{T}$. This will increase the FDR value if the real dataset has many more loci above the noise threshold than the neutral dataset and vice versa.
```{r}
read_ms_twopop_sweeps <- function(model){
mpath <- paste("hpc/ms/ms_",model,"_mi_10_sweeps.gff",sep="")
mis <- read_tsv(mpath,col_names = gff_cols,col_types = cols()) %>%
add_column(model=model) %>%
add_column(population="mi")
mpath <- paste("hpc/ms/ms_",model,"_nomi_10_sweeps.gff",sep="")
nomis <- read_tsv(mpath,col_names = gff_cols,col_types = cols()) %>%
add_column(model=model) %>%
add_column(population="nomi")
rbind(mis,nomis)
}
calc_fdr_pop <- function(indata,pop,decoy_model){
data <- indata %>% filter(model %in% c(decoy_model,'real')) %>% filter(population==pop)
ndecoy <- data %>% filter(model==decoy_model) %>% nrow()
nreal <- data %>% filter(model=='real') %>% nrow()
decoy_scaling <- nreal/ndecoy
# print(decoy_scaling)
data %>% ungroup() %>%
filter(population==pop) %>%
filter(model %in% c(decoy_model,"real")) %>%
mutate(is_decoy=ifelse(model==decoy_model,1,0)) %>%
arrange(desc(LR)) %>%
mutate(n_decoy = cumsum(is_decoy)) %>%
mutate(decoy_ratio = (n_decoy*decoy_scaling)/(row_number())) %>%
mutate(rn =row_number())
}
calc_fdr <- function(decoy_model,data){
mi_fdr_data <- data %>% calc_fdr_pop('mi',decoy_model)
nomi_fdr_data <- data %>% calc_fdr_pop('nomi',decoy_model)
fdr_d <- rbind(mi_fdr_data,nomi_fdr_data)
fdr_d %>% filter(model=='real') %>%
mutate(model=decoy_model)#%>% add_column(model=model)
}
```
```{r}
gff_cols <- c('scaffold','source','feature','start','end','score','strand','phase','attributes')
models <- (list.files("hpc/ms/","*nomi_10_sweeps.gff") %>% str_match("ms_(.*)_nomi"))[,2]
mi_sweeps <- read_tsv("hpc/SF2/mi_10_sweeps.gff",col_names = gff_cols,col_types = cols()) %>%
add_column(model='real') %>%
add_column(population="mi")
nomi_sweeps <- read_tsv("hpc/SF2/nomi_10_sweeps.gff",col_names = gff_cols,col_types = cols()) %>%
add_column(model='real') %>%
add_column(population="nomi")
discrete_sweeps <- do.call(rbind,lapply(models,read_ms_twopop_sweeps))
discrete_sweeps <- rbind(mi_sweeps,nomi_sweeps,discrete_sweeps)
discrete_sweeps <- discrete_sweeps %>% mutate(LR=score)
discrete_sweeps_fdr <- calc_fdr("asym_mig",discrete_sweeps)
discrete_fdr_d_all <- do.call(rbind,lapply(c("msmc",models),calc_fdr,discrete_sweeps))
```
This leads to the following false discovery rates for various neutral models for the Magnetic Island and Northern populations respectively. Red vertical lines represent CLR threshold values for which the FDR reaches 10%.
```{r}
library(ggpubr)
two_epoch_models <- c("no_mig","sym_mig","asym_mig","priorsize_asym_mig","asym_mig_size","isolation_asym_mig","msmc")
named_model_order <- 1:7
names(named_model_order) <- two_epoch_models
discrete_fdr_d_all_2e <- discrete_fdr_d_all %>%
filter(model %in% two_epoch_models) %>%
mutate(model_order = named_model_order[model]) %>%
mutate(population = ifelse(population=="mi","Magnetic Island","North"))
discrete_fdr_d_all_2e$model <- factor(discrete_fdr_d_all_2e$model, levels = two_epoch_models)
ggplot(discrete_fdr_d_all_2e,aes(x=LR)) +
geom_line(aes(y=decoy_ratio)) +
xlim(0,250) + ylim(0,1) +
facet_grid(model~population) +
geom_hline(yintercept=0.1, color='red') +
ylab("False Discovery Rate") +
xlab("Sweepfinder CL Threshold, T") + theme_pubclean() +
theme(strip.text = element_text(size = 8))
```
```{r}
discrete_fdr_d_all_2e %>%
filter(model=='isolation_asym_mig') %>%
filter(population=="North") %>%
ggplot(aes(x=LR)) +
geom_line(aes(y=decoy_ratio)) +
xlim(0,250) + ylim(0,1) +
facet_grid(model~population) +
geom_hline(yintercept=0.1, color='red') +
ylab("False Discovery Rate") +
xlab("Sweepfinder CL Threshold, T") + theme_pubclean() +
theme(strip.text = element_text(size = 8))
ggsave("figures/iso_asym_mig_fdr.png",width=6,height=4)
```
For the analysis including all northern reefs together we used a CLR threshold of 100 whereas when looking at reefs separately we used CLR threshold of 50. What FDR do these correspond to
```{r}
t100 <- discrete_fdr_d_all_2e %>% group_by(model,population) %>%
filter(LR>=100) %>% summarise(fdr = max(decoy_ratio),threshold=100) %>%
filter(model=='isolation_asym_mig', population=='North')
t50 <- discrete_fdr_d_all_2e %>% group_by(model,population) %>%
filter(LR>=50) %>% summarise(fdr = max(decoy_ratio),threshold=50) %>%
filter(model=='isolation_asym_mig', population=='North')
knitr::kable(rbind(t100,t50))
```