-
Notifications
You must be signed in to change notification settings - Fork 3
/
00.Simulation.Rmd
executable file
·213 lines (162 loc) · 4.95 KB
/
00.Simulation.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
---
######################################
# Click on "Run Document" in RStudio #
######################################
title: "Quick Introduction to Simulation"
author: "Miguel de Navascués"
output: learnr::tutorial
runtime: shiny_prerendered
bibliography: src/references.bib
---
These learning resources are available at [GitHub](https://github.com/mnavascues/ABCRFtutorial) & [Zenodo](http://doi.org/10.5281/zenodo.1435503).
## Setup (tl;dr)
This unit uses R packages `learnr` [@Schloerke2020] and `phyclust` [@Chen2011], function to read *ms* output from `readms.output.R` script distributed with `ms` [@Hudson2002], and functions to calculate summary statistics from *ms* simulations in file `ms_sumstats.R` are modified from [@Stadler2009a; @Stadler2009b].
```{r setup, include=TRUE}
library(learnr)
# function ms() (Hudson's coalescent simulator)
library(phyclust, quietly=T)
# function read.ms.output()
source("src/readms.output.R", local =T)
# functions to calculate summary statistics from ms output
source("src/ms_sumstats.R", local =T)
```
## What is a simulation?
Discuss:
- What is a computer simulation?
- What are the sources of randomness in a simulation?
- How are random numbers produced in a computer?
## Simulation of a coin toss
### Exercise 1
Simulate a ten coin tosses ($n=10$) using R function `sample(_____)` and probability of getting head $p$. Use a seed for random number generator so your simulation is reproducible:
```{r simulation_coin_toss, exercise=TRUE, exercise.lines=5}
n <- 10
p <- 0.5
sample(x=c("H","T"),size=_____, replace=_____, prob=_____)
```
```{r simulation_coin_toss-hint}
n <- 10
p <- 0.5
sample(x=c("H","T"),size=n, replace=T, prob=c(p,1-p))
```
```{r simulation_coin_toss-solution}
set.seed(123456)
n <- 10
p <- 0.5
sample(x=c("H","T"),size=n, replace=T, prob=c(p,1-p))
```
### Exercise 2
Simulate directly the number of heads ($h'$) using `rbinom(_____)`:
```{r simulation_coin_toss_rbinom, exercise=TRUE}
n <- 10
p <- 0.5
num_of_sims = 1
rbinom(n=_____,size=_____,prob=_____)
```
```{r simulation_coin_toss_rbinom-solution}
n <- 10
p <- 0.5
num_of_sims = 1
rbinom(n=num_of_sims,size=n,prob=p)
```
## Simulation of population genetic data
Simulation via the coalescent. For convenience, we will use the R package `phyclust` [@Chen2011] that includes a version of coalescent simulator `ms` in this tutorial. In practice, use of other simulators (e.g. msprime [@Kelleher2016], SLiM [@Haller2017], fastsimcoal [@Excoffier2011] or the original implementation of ms [@Hudson2002]) is recommended.
### Exercise 3
Simulate with `ms()` a sample of $n=10$ individuals sequenced for a mitochondrial DNA locus of lenght $l=10000$bp in a population of size $N=10000$ with mutation rate (per bp) $\mu=10^{-8}$:
```{r coalescent_sim, exercise=TRUE}
n<-10
l<-10000
N<-10000
u<-1e-8
num_of_sims<-1
_____
ms(nsam=_____,nreps=_____,opt=_____)
```
```{r coalescent_sim-hint}
n<-10
l<-10000
N<-10000
u<-1e-8
num_of_sims<-1
theta <- _____
ms(nsam=_____,nreps=_____,opt=paste("-t",theta))
```
```{r coalescent_sim-solution}
n<-10
l<-10000
N<-10000
u<-1e-8
num_of_sims<-1
theta <- 2*N*u*l
ms(nsam=n,nreps=num_of_sims,opt=paste("-t",theta))
```
We can put this text into a more usable form (i.e. matrix of polymorphisms, etc) using a function distributed with the original software `ms` and that we have loaded in the setup with:
```{r eval=FALSE, echo=TRUE}
source("src/readms.output.R")
```
### Exercise 4
Use `read.ms.output()` function to read ms output and print only the genotype matrix (i.e. `gametes`).
```{r coalescent_sim_segsites, exercise=TRUE, exercise.lines=10}
n<-10
l<-10000
N<-10000
u<-1e-8
num_of_sims<-1
theta <- 2*N*u*l
msout <- ms(nsam=n,nreps=num_of_sims,opt=paste("-t",theta))
sim_data <- read.ms.output(____)
_____
```
```{r coalescent_sim_segsites-hint}
n<-10
l<-10000
N<-10000
u<-1e-8
num_of_sims<-1
theta <- 2*N*u*l
msout <- ms(nsam=n,nreps=num_of_sims,opt=paste("-t",theta))
sim_data <- read.ms.output(txt=msout)
sim_data$_____
```
```{r coalescent_sim_segsites-solution}
n<-10
l<-10000
N<-10000
u<-1e-8
num_of_sims<-1
theta <- 2*N*u*l
msout <- ms(nsam=n,nreps=num_of_sims,opt=paste("-t",theta))
sim_data <- read.ms.output(txt=msout)
sim_data$gametes
```
### Exercise 5
Calculate summary statistics (number of segregating sites, number of haplotypes, $\pi$ and Tajima's $D$) for the simulations. Functions `S()`, `NH()`, `PI()`, `TajimaD()` and `FuLiD()` have been loaded using:
```{r eval=FALSE, echo=TRUE}
source("src/ms_sumstats.R")
```
```{r summary_stats, exercise=TRUE, exercise.lines=15}
n<-10
l<-10000
N<-10000
u<-1e-8
num_of_sims <- 10
theta <- 2*N*u*l
msout <- ms(nsam=n,nreps=num_of_sims,opt=paste("-t",theta))
sim_data <- read.ms.output(txt=msout)
```
```{r summary_stats-solution}
n<-10
l<-10000
N<-10000
u<-1e-8
num_of_sims <- 10
theta <- 2*N*u*l
msout <- ms(nsam=n,nreps=num_of_sims,opt=paste("-t",theta))
sim_data <- read.ms.output(txt=msout)
S(sim_data)
NH(sim_data)
PI(sim_data)
TajimaD(sim_data)
FuLiD(sim_data)
```
#
#### References