-
Notifications
You must be signed in to change notification settings - Fork 0
/
index.Rmd
260 lines (211 loc) · 16.8 KB
/
index.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
---
title: "Vignette: Regression model of FDS"
output: rmdformats::downcute
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
# Background
In order to estimate frequency-dependent selection (FDS) using the framework of selection gradient analysis, we have proposed a regression model that takes genotype similarity into account as a covariate (Sato et al. 2023 *Evolution*). The purpose of this vignette is to show how to use the proposed method with toy data. Throughout this vignette, we will analyze toy data with a single-locus or GWAS-style structure under either a split or continuous subpopulation structure (see 'Usage' below). The Rmarkdown source and input data for this vignette are available at the following Github repository: <https://github.com/yassato/RegressionFDS>
Before analyzing the toy data, let us review the proposed model. The proposed method is given by either a linear regression model:
$$y_i = \beta_0 + \beta_1x_i + \frac{\beta_2}{N_k}\sum^{N_{k}}_{j=1}{x_ix_j} + e_i~~~~~(1)$$
or a multiplicative regression model:
$$y_i = \beta_0 + \beta_1x_i + \frac{\beta_2}{N_k}\sum^{N_{k}}_{j=1}{x_ix_j} + \frac{\beta_{12}x_i}{N_k}\sum^{N_{k}}_{j=1}{x_ix_j} + e_i~~~~~(2)$$
In these equations $y_i$ is the fitness component of individual $i$; $x_i$ is the allelic status of individual $i$ at a given locus; $\beta_0$ is the intercept; and $\beta_1$ is the coefficient of directional selection on the locus. The term $(\sum^{N_{k}}_{j=1}{x_ix_j}) / N_k$ represents the mean genotype similarity between the focal individual $i$ and its counterpart individuals $j$. As described in Sato et al. (2022), a positive effect of genotype similarity ($\beta_2$) on fitness component represents positive frequency-dependent selection (FDS) on relative fitness, while a negative effect represents negative FDS. The the interaction term coefficient $\beta_{12}$ represents asymmetric FDS on absolute fitness between two alleles.
# Usage
Here, we will first examine the structure of the toy data and then analyze these data using the proposed regression model (Eq. 2). Four types of toy data are prepared for: (1) a single-locus example with split subpopulations; (2) a single-locus example with a continuous space; (3) a GWAS example with split subpopulations; and (4) a GWAS example with continuous subpopulations.
## Single-locus examples
The data for the single-locus examples are as simple as a single data frame can store. They are saved in R Data style (.rds), but CSV and TSV formats are also possible. Using the toy data, we will estimate $\beta_1$, $\beta_2$, and $\beta_{12}$ and depict the fitness functions as follows.
## (1) split subpopulations
For each individual `ID`, data from split subpopulations should include the fitness component measure `y`, morph-type 'gi' (A or a), and its affiliated subpopulation `group`.
```{r td1}
d1 = readRDS("simulation/output/toy1.rds")
head(d1)
```
Before calculating genotype similarity, we convert genotypes into digits. Assuming complete dominance of the A morph over the a morph under Mendelian inheritance, we convert the A and a morphs into +1 and -1, respectively.
```{r replace1}
d1$gi[d1$gi=="A"] = 1
d1$gi[d1$gi=="a"] = -1
d1$gi = as.numeric(d1$gi) # make gi numeric
head(d1)
```
Then we calculate the mean allelic similarity at the focal locus. The following implements the calculation of the mean allelic similarity within a subpopulation based on $(\sum^{N_{k}}_{j=1}{x_ix_j}) / N_k = x_i(\sum^{N_{k}}_{j=1}{x_j}) / N_k = x_i\bar{x_j}$.
```{r gsim1}
gigj = c()
for(i in 1:nrow(d1)) {
ds = d1[-i,]
ds = subset(ds,group==d1$group[i])
gigj = c(gigj,d1$gi[i]*mean(ds$gi))
}
d1 = data.frame(d1,gigj)
hist(d1$gigj,xlim=c(-1,1))
```
**Figure 1.** Histogram of the term $(\sum^{N_{k}}_{j=1}{x_ix_j}) / N_k$ ranging from -1 to +1.
To estimate $\beta_1$, $\beta_2$, and $\beta_{12}$, we use a linear mixed model with the subpopulation ID considered a random effect, as follows.
```{r lmm1}
res = lme4::lmer(y~gi*gigj+(1|group),data=d1)
summary(res)
```
Based on $\hat{\beta}_1$, $\hat{\beta}_2$, and $\hat{\beta}_{12}$, we finally visualize fitness functions of FDS. According to Appendix S2 Sato et al. (2022), the fitness function of the A morph and the a morph is given by $y_\mathrm{A} = \beta_0 + \beta_1 + (\beta_{12}+\beta_2)(2f_\mathrm{A}-1)$ and $y_\mathrm{a} = \beta_0 - \beta_1 + (\beta_{12}-\beta_2)(2f_\mathrm{A}-1)$, respectively. The mean fitness is given by $\bar{y} = f_\mathrm{A}y_\mathrm{A} + (1-f_\mathrm{A})y_\mathrm{a}$, where $f_A$ is the frequency of the A morph.
```{r plot1}
freq = c() # calculate allele frequency within a subpopulation
for(i in 1:nrow(d1)) {
ds = subset(d1,group==d1$group[i])
freq = c(freq,mean(ds$gi))
}
freq = (freq/2) + 0.5
d1 = data.frame(d1,freq)
library(ggplot2)
plt = function(b0,b1,b2,b12) {
f_star = 0.5-(b1/(2*b2)) # f_star: equilibrium frequency
p = ggplot(d1, aes(x=freq,y=y)) + geom_jitter(pch=d1$gi+2,colour="grey",width=0.05) + # jitter plots for data points
theme_classic() + ylab("Fitness") + xlab("phenotype-level frequency of A") + xlim(0,1) +
geom_function(aes(x=1,y=1),fun=function(x,b0,b1,b2,b12) { (b12+b2)*(2*x-1)+b0+b1 }, args=list(b0,b1,b2,b12)) + # fitness function for A morph
geom_function(aes(x=1,y=1),fun=function(x,b0,b1,b2,b12) { (b12-b2)*(2*x-1)+b0-b1 }, args=list(b0,b1,b2,b12), colour=grey(0.0,0.33)) + # fitness function for a morph
geom_function(aes(x=1,y=1),fun=function(x,b0,b1,b2,b12) { 2*b2*x*(2*x-1)+(b12-b2)*(2*x-1)+b0-b1+2*b1*x }, args=list(b0,b1,b2,b12), lty=2) + # mean fitness
geom_point(aes(x=f_star,y=(b12+b2)*(2*f_star-1)+b0+b1),pch=16,size=3)
return(p)
}
p = plt(b0=coef(summary(res))[1,1],
b1=coef(summary(res))[2,1],
b2=coef(summary(res))[3,1],
b12=coef(summary(res))[4,1])
p
```
**Figure 2.** Negative frequency-dependent selection at a single locus in split subpopulations. A black line and plus marks indicate the fitness of the A morph. A grey line and open circles indicate the fitness of the a morph. A dashed curve indicates the population-level mean fitness.
## (2) continuous space
Instead of the subpopulation ID, data from continuous space should include the spatial positions of individuals in a two dimensional space along the x- and y-axes (`X` and `Y`). Both continuous values (e.g., GPS locality) and relative distance (e.g., positions in a lattice space) can be used to record the values of `X` and `Y`.
```{r td2}
d2 = readRDS("simulation/output/toy2.rds")
head(d2)
```
Similarly to the split subpopulation case, we convert genotypes into digit in the case of continuous space. The genotype similarity in a continuous space can be calculated using the `nei_coval` function in the `rNeighborGWAS` package. For detailed usage of the `nei_coval` function, see also "GWAS examples" below.
```{r gsim2}
d2$gi[d2$gi=="A"] = 1
d2$gi[d2$gi=="a"] = -1
d2$gi = as.numeric(d2$gi)
smap = cbind(d2$X,d2$Y)
geno = as.matrix(d2$gi)
gigj = rNeighborGWAS::nei_coval(geno,smap,scale=sqrt(2+0.01))
d2 = data.frame(d2,gigj)
```
where `scale=sqrt(2+0.01)` corresponds to the nearest neighbors (i.e.,$N_k$=8). Then, we can use a standard regression model since there are no grouping covariates in the case of a continuous space.
```{r lm2}
res = lm(y~gi*gigj,data=d2)
summary(res)
```
As with split subpopulations, we can depict the fitness functions in response to the local frequency of the A morph.
```{r plot2}
freq = (d2$gi*d2$gigj/2) + 0.5
d2 = data.frame(d2,freq)
plt = function(b0,b1,b2,b12) {
f_star = 0.5-(b1/(2*b2)) # f_star: equilibrium frequency
p = ggplot(d2, aes(x=freq,y=y)) + geom_jitter(pch=d2$gi+2,colour="grey",width=0.05) + # jitter plots for data points
theme_classic() + ylab("Fitness") + xlab("phenotype-level local frequency of A") + xlim(0,1) +
geom_function(aes(x=1,y=1),fun=function(x,b0,b1,b2,b12) { (b12+b2)*(2*x-1)+b0+b1 }, args=list(b0,b1,b2,b12)) + # fitness function for A morph
geom_function(aes(x=1,y=1),fun=function(x,b0,b1,b2,b12) { (b12-b2)*(2*x-1)+b0-b1 }, args=list(b0,b1,b2,b12), colour=grey(0.0,0.33)) + # fitness function for a morph
geom_function(aes(x=1,y=1),fun=function(x,b0,b1,b2,b12) { 2*b2*x*(2*x-1)+(b12-b2)*(2*x-1)+b0-b1+2*b1*x }, args=list(b0,b1,b2,b12), lty=2) + # mean fitness
geom_point(aes(x=f_star,y=(b12+b2)*(2*f_star-1)+b0+b1),pch=16,size=3)
return(p)
}
p = plt(b0=coef(summary(res))[1,1],
b1=coef(summary(res))[2,1],
b2=coef(summary(res))[3,1],
b12=coef(summary(res))[4,1])
p
```
**Figure 3.** Negative frequency-dependent selection at a single locus in a continuous space. A black line and plus marks indicate the fitness of the A morph. A grey line and open circles indicate the fitness of the a morph. A dashed curve indicates the population-level mean fitness.
## GWAS examples
Genotype data in a variant call format (.vcf) can be imported using the `gaston` package. Its `select.snps` function can cut off SNPs with minor allele frequency (MAF). The `gaston2neiGWAS` in the rNeighborGWAS package converts the genotypes in an additive manner (AA, Aa, and aa) into +1, 0, -1, respectively. The rows and columns in `g$geno` contain individuals and loci, respectively.
```{r gdata}
g = gaston::read.vcf("simulation/output/1_NFDS.vcf.gz")
g = gaston::select.snps(g,g@snps$maf>0.01)
g = rNeighborGWAS::gaston2neiGWAS(g)
print(g$geno[1:10,1:10])
```
To analyze FDS, we combine these genotypes with phenotype records as follows.
## (3) split subpopulations
As with the single-locus example, phenotype data from split subpopulations should include the subpopulation ID 'group'.
```{r td3}
d3 = readRDS("simulation/output/toy3.rds")
head(d3)
```
After applying the MAF cut-off to the genotype data, we extract the individuals and assume complete dominance of the A alleles over the a alleles.
```{r ind3}
geno = g$geno[d3$ID,] # select the same individuals as phenotype files
geno[geno==0] = 1 # skipping this line encodes genotypes in an additive way
```
We then calculate the genotype similarity using the `nei_coval` function in the `rNeighborGWAS` package. Note that the `scale=10^6` and `grouping` options allow us to assign individuals to subpopulations and calculate the genotype similarity within subpopulations.
```{r gsim3}
smap = cbind(runif(nrow(d3),0,1),runif(nrow(d3),0,1)) # random dummy map in a small space
g_nei = rNeighborGWAS::nei_coval(geno, # matrix of individual genotypes
smap, # spatial map including positions at x- and y-axis (dummy)
scale=10^6, # scale=LARGE VALUE calculates the genotype similarity all over a subpopulation
grouping=d3$group) # grouping splits data into subpopulations
```
Association tests are performed using the `nei_lmm` function in the `rNeighborGWAS` package. The `asym=TRUE` option is selected to test the asymmetric effects $\beta_{12}$ as well as $\beta_2$ and $\beta_1$.
```{r gwas3}
res = rNeighborGWAS::nei_lmm(geno, # matrix of individual genotypes
g_nei, # matrix of genotype similarities
d3$y, # phenotype vector
addcovar=model.matrix(~d3$group), # covariate can be given by a matrix
asym=TRUE) # asym=TRUE tests beta_12 as well
head(res)
```
Manhattan plots are depicted using the `gaston` package.
```{r man3}
x = data.frame(g$gmap,res$p_nei)
colnames(x) = c("chr","pos","p")
gaston::manhattan(x,las=1,thinning=FALSE)
abline(h=-log10(0.05/nrow(g_nei)),lty=2,col="grey") # Bonferroni correction
```
**Figure 4.** Manhattan plot showing the association score of -log~10~(*p*) against genomic positions in the case of split subpopulations. A horizontal dashed line indicate the genome-wide Bonferroni threshold at *p* < 0.05.
## (4) continuous space
As with the single-locus example, phenotype data from continuous space should include the spatial positions along the x- and y-axes (`X` and `Y`).
```{r td4}
d4 = readRDS("simulation/output/toy4.rds")
head(d4)
```
As in the case of split subpopulations, we extract the individuals and assume complete dominance of the A alleles over the a alleles.
```{r ind4}
geno = g$geno[d4$ID,] # select the same individuals as phenotype files
geno[geno==0] = 1 # skipping this line encodes genotypes in an additive way
```
For a continuous space, we calculate the genotype similarity using the `nei_coval` function in the `rNeighborGWAS` package. The option `scale=sqrt(2+0.01)` corresponds to the nearest neighbors (i.e.,$N_k$=8).
```{r gsim4}
smap = cbind(d4$X,d4$Y)
g_nei = rNeighborGWAS::nei_coval(geno, # matrix of individual genotypes
smap, # spatial map including positions at x- and y-axis
scale=sqrt(2)+0.01) # scale=sqrt(2)+0.01 refers to the nearest neighbors
```
As in the case of split subpopulations, association tests are performed using the `nei_lmm` function in the `rNeighborGWAS` package. The `asym=TRUE` option is selected to test the asymmetric effects $\beta_{12}$ as well as $\beta_2$ and $\beta_1$.
```{r gwas4}
res = rNeighborGWAS::nei_lmm(geno, # matrix of individual genotypes
g_nei, # matrix of genotype similarities
d4$y, # phenotype vector
asym=TRUE) # asym=TRUE tests beta_12 as well
head(res)
```
Manhattan plots are depicted using the `gaston` package.
```{r man4}
x = data.frame(g$gmap,res$p_nei)
colnames(x) = c("chr","pos","p")
gaston::manhattan(x,las=1,thinning=FALSE)
abline(h=-log10(0.05/nrow(g_nei)),lty=2,col="grey")
```
**Figure 5.** Manhattan plot showing the association score of -log~10~(*p*) against genomic positions in the case of continuous space. A horizontal dashed line indicate the genome-wide Bonferroni threshold at *p* < 0.05.
# Notes
### When the interaction term is significant:
If $\beta_{12}$ is significant, the estimates of $\beta_1$ and $\beta_2$ may not be correct in Equation (2). Practically, we should first test $\beta_{12}$ using the multiplicative model Equation (2). If $\beta_{12}$ is significant in Equation (2), we then estimate $\beta_2$ using the linear model Equation (1). Then $\hat{\beta}_{12}$ from Equation (2), $\hat{\beta}_1$ and $\hat{\beta}_2$ from Equation (1) may be used to depict the fitness functions.
### How to analyze multi-year/site data on continous space:
We can jointly use the `scale` and `grouping` option in the `nei_coval` function in the `rNeighborGWAS` package. If data on a continuous space are collected from multiple years or sites, individuals in different years or sites should be separated using the `grouping` option. The years or sites should be made numeric and assigned to the `grouping` option. Then, the neighbor space to be referred should be specified using the `scale` option. In addition to these options, the `addcover` option in the `nei_lmm` function allows us to consider fixed effects of years or sites when performing association tests. See the vignette of the rNeighborGWAS package for this usage at <https://cran.r-project.org/web/packages/rNeighborGWAS/vignettes/rNeighborGWAS.html>.
### How to determine the spatial scale in a continuous space:
This is not easy to determine but feasible by calculating the proportion of fitness variation explained (PVE) by the polygenic effects from the genotype similarity. If PVE peaks at a particular spatial scale, this scale would be optimal. See also Sato et al. (2021) *Heredity* for details. The vignette of the rNeighborGWAS package (<https://cran.r-project.org/web/packages/rNeighborGWAS/vignettes/rNeighborGWAS.html>) also provides more information on this topic.
# Citation
When you need to cite this instruction, please cite the following publication: Sato Y, Takahashi Y, Xu C, Shimizu KK. (2023) Detecting frequency-dependent selection through the effects of genotype similarity on fitness components. Evolution. https://doi.org/10.1093/evolut/qpad028. This instruction file was developed as a part of the supplementary materials of Sato et al., (2023) *Evolution* during its peer-review.
# References
1. Bates D, Maechler M, Bolker B, Walker S. (2015) Fitting Linear Mixed-Effects Models Using lme4. Journal of Statistical Software, 67(1), 1-48. doi:10.18637/jss.v067.i01.
1. Perdry H, Dandine-Roulland C. (2020) gaston: Genetic Data Handling (QC, GRM, LD, PCA) & Linear Mixed Models. R package version 1.5.7. https://CRAN.R-project.org/package=gaston
1. Sato Y, Yamamoto E, Shimizu KK, Nagano AJ. (2021) Neighbor GWAS: incorporating neighbor genotypic identity into genome-wide association studies of field herbivory. Heredity 126(4):597-614. https://doi.org/10.1038/s41437-020-00401-w
1. Sato Y, Takahashi Y, Xu C, Shimizu KK. (2023) Detecting frequency-dependent selection through the effects of genotype similarity on fitness components. Evolution. https://doi.org/10.1093/evolut/qpad028
1. Wickham H. (2016) ggplot2: Elegant Graphics for Data Analysis. Springer-Verlag New York.