- Single-environment model
This model is obtained by regressing the phenotype vector containing the n records available in the kth environment (k=1,2,...,s environments), , where i=1,2,...,n indexes lines (individuals), on p markers using a linear model in the form
where is the intercept,
is a n-vector of ones,
is the
matrix of centered and standardized markers available in the kth environment,
is a p-vector of marker effects, and
is the n-vector of residuals.
This model can be represented as a G-BLUP model by setting , this is:
where is a random effect assumed
with the genomic relationship matrix estimated as
.
- Across-environments model
This model assumes that effects of markers are the same across environments, this is . The model above can be simultaneously fitted for all environments as (assume s=3 environments)
Similarly, it can be represented also as a G-BLUP model, by making , as:
where the random effect is assumed
with
being the marker-derived genomic relationship calculated as
- MxE model
It models GxE interaction using a marker-by-environment (MxE) approach in which the effect of the jth marker on the kth environment, , is descomposed into an effect that is common to all environments (bj0) and an effect that is specific to each environment (bjk), this is
. Thus, the multi-environmental model is
Likewise, by making and
, the G-BLUP representation is
where the random effects
are assumed and
, respectively. Here,
is described previously and
- Reaction Norm model
The G-BLUP model presented in the across-environment approach can be extended to incorporate GxE by introducing covariance structures as a funcion of the marker information. This model, assumes the environment (Ek) as a random effect as . These random effects are conected with individuals through the design matrix Ze. Thus, the main effect of environments, e = ZeE, is
where are the effects of the environment k for the ith individual (i=1,...,n). The reaction norm model, incorporates GxE by introducing the interaction terms
, as
where is the interaction term between the random effect of markers and environments. It can be proved that, aproximately,
where was described above and
denotes the element by element product (known as Hadamard product).
Note. The Hadamard product above will yield a matrix similar to above described but with
. This is, reaction norm model estimates environment-specific effects with constant variance
in contrast wth MxE model that accomodates environment-specific variance,
, k=1,...,s.
Performance of the Reaction Norm and MxE model will be compared with that of the across-environments model that ignores GxE modeling and the single-environment model which is fitted within each environment.
The prediction power of the model will be assessed using the training-testing (TRN-TST) random partitions approach. Data is randomly splitted into training and testing sets. Model parameters are estimated in training set and model is tested in TST set. Two main estimations problems are addressed using the multi-environments models.
-
Cross Validation 1 (CV1). Represent a scheme of prediction of lines that have not been evaluated in any field trials.
-
Cross Validation 2 (CV2). Represent a scheme of prediction of lines that have been evaluated in some but all target environments. Thus, prediction of non-evaluated lines benefits from borrowing of information from lines that were evaluated in other environments.
In our case, we will use 70% of the data for training set and the remaining 30% for the testing set. For CV1, we will create a scheme in which 30% of the lines are missing in all environments. CV2 scheme is created by having 30% of the entries missing in one environment but present in all the rest of environments.
This procedure of TRN-TST can be repeated many times to allow for estimation of standard errors (SE).
The following R code, prepareData_multi.R, can be used to prepare the data for analizes of the multi-environmental models.
setwd("/mnt/home/lopezcru/GS")
rm(list=ls())
library(BGLR)
# Load data
data(wheat)
X <- wheat.X
Y <- wheat.Y
# Select environments. For instance, environments 2,4, and 5
Y <- Y[,c(2,3,4)]
# Genomic relationship matrix
M <- scale(X)
G <- tcrossprod(M)/ncol(X)
# Design matrix for individuals. It connects individuals with environments
GID <- factor(rep(rownames(Y),ncol(Y)),levels=rownames(Y))
Zg <- model.matrix(~GID-1)
# Design matrix for environments. Used in the multi-environment R-Norm model
envID <- factor(rep(colnames(Y),each=nrow(Y)),levels=colnames(Y))
ZE <- model.matrix(~envID-1)
# Covariance structure for effects
ZgGZgt <- Zg%*%G%*%t(Zg) # Genetic effect
ZEZEt <- tcrossprod(ZE) # Environmental effect
GE <- ZgGZgt*ZEZEt # GxE interaction term (R-Norm model)
# Eigen decomposition (to speed computational time)
eigen_G <- eigen(G)
eigen_G0 <- eigen(ZgGZgt)
eigen_GE <- eigen(GE)
# Interaction terms (MxE model)
MxE_eigen <- vector("list",ncol(Y))
for(env in 1:ncol(Y)){
tmp <- rep(0,ncol(Y)) ; tmp[env] <- 1; G1 <- kronecker(diag(tmp),G)
MxE_eigen[[env]] <- eigen(G1)
}
# Save prepared data
dir.create("multiEnvironment")
save(Y,envID,eigen_G,eigen_G0,eigen_GE,MxE_eigen,file="multiEnvironment/prepData_multi.RData")
Code below, get_VarComps_multi.R script, can be used after 'data preparation' part to fit all the models and to extract variance components.
setwd("/mnt/home/lopezcru/GS")
rm(list=ls())
library(BGLR)
load("multiEnvironment/prepData_multi.RData")
n <- nrow(Y)
nEnv <- ncol(Y)
y <- as.vector(Y)
set.seed(123)
# Matrix to store results. It will save variance components for each model
outVAR <- matrix(NA,ncol=4,nrow=1+2*nEnv)
dimnames(outVAR) <- list(c("Main",rep(paste0("Env ",colnames(Y)),2)),c("Single","Across","MxE","R-Norm"))
# Number of iterations and burn-in for Bayesian models
nIter <- 30000; burnIn <- 2000
#--------------------------------------------------------
# 1. Single environment (within-environment) model
#--------------------------------------------------------
ETA <- list(G=list(V=eigen_G$vectors,d=eigen_G$values,model='RKHS'))
for(env in 1:nEnv){
fm <-BGLR(y=Y[,env],ETA=ETA,nIter=nIter,burnIn=burnIn)
outVAR[env+1,1] <- fm$ETA[[1]]$varU
outVAR[env+4,1] <- fm$varE
}
#--------------------------------------------------------
# 2. Across-environments model
#--------------------------------------------------------
ETA <- list(list(~envID-1,model="FIXED"))
ETA[[2]] <- list(V=eigen_G0$vectors,d=eigen_G0$values,model='RKHS')
# Model Fitting
fm <- BGLR(y=y,ETA=ETA,nIter=nIter,burnIn=burnIn)
outVAR[1,2] <- fm$ETA[[2]]$varU
outVAR[(1:nEnv)+4,2] <- fm$varE
#--------------------------------------------------------
# 3. MxE interaction model
#--------------------------------------------------------
ETA <- list(list(~envID-1,model="FIXED"))
ETA[[2]] <- list(V=eigen_G0$vectors,d=eigen_G0$values,model='RKHS')
# Adding interaction terms
for(env in 1:nEnv){
eigen_G1 <- MxE_eigen[[env]]
ETA[[(env+2)]] <- list(V=eigen_G1$vectors,d=eigen_G1$values,model='RKHS')
}
# Model Fitting
fm <- BGLR(y=y,ETA=ETA,nIter=nIter,burnIn=burnIn)
outVAR[1,3] <- fm$ETA[[2]]$varU
for(env in 1:nEnv) outVAR[env+1,3] <- fm$ETA[[env+2]]$varU
outVAR[(1:nEnv)+4,3] <- fm$varE
#--------------------------------------------------------
# 4. Reaction-Norm model
#--------------------------------------------------------
ETA <- list(list(~envID-1,model="FIXED"))
ETA[[2]] <- list(V=eigen_G0$vectors,d=eigen_G0$values,model='RKHS')
ETA[[3]] <- list(V=eigen_GE$vectors,d=eigen_GE$values,model="RKHS")
# Model Fitting
fm <- BGLR(y=y,ETA=ETA,nIter=nIter,burnIn=burnIn)
outVAR[1,4] <- fm$ETA[[2]]$varU
outVAR[(1:nEnv)+1,4] <- fm$ETA[[3]]$varU
outVAR[(1:nEnv)+4,4] <- fm$varE
outVAR
# Save results
write.table(outVAR,file="multiEnvironment/varComps.csv",sep=",",row.names=F)
The following table is the output of the code above for nIter=30000
and burnIn=2000
.
Using a GBLUP approach, the prediction power of the multi-environment models (MxE and Reaction Norm) will be compared with that that ignores GxE (across-environment) and with the GBLUP model fitted within environment.
After running the 'data preparation' part, it can be chosen either to perform CV1 or CV2 aproaches
- Cross Validation 1 (CV1)
Code below will generate a matrix YNA containing "NA" values for the entries corresponding to the TST set mimicing the CV1 prediction problem. It generates a 'list' with 'm' matrices containing the TRN-TST partitions
setwd("/mnt/home/lopezcru/GS")
rm(list=ls())
#=========================================================
# User specifications
#=========================================================
# Number of replicates
m <- 100
# Percentage of the data assigned to Testing set
percTST <- 0.3
#=========================================================
# Load data
load("multiEnvironment/prepData_multi.RData")
n <- nrow(Y)
# Creation of seed for repeated randomizations
set.seed(123)
seeds <- round(seq(1E3,1E6,length=m))
nTST <- round(percTST*n)
YNA <- vector("list",m)
for(k in 1:m)
{
set.seed(seeds[k])
indexTST <- sample(1:n,size=nTST,replace=FALSE)
YNA0 <- Y
YNA0[indexTST,] <- NA
YNA[[k]] <- YNA0
}
# Save YNA matrix
save(YNA,file="multiEnvironment/YNA_CV1_multiEnv.RData")
- Cross Validation 2 (CV2)
Code below will generate a matrix YNA containing "NA" values for the entries corresponding to the TST set mimicing the CV2 prediction problem. It generates a 'list' with 'm' matrices containing the TRN-TST partitions
setwd("/mnt/home/lopezcru/GS")
rm(list=ls())
#=========================================================
# User specifications
#=========================================================
# Number of replicates
m <- 100
# Percentage of the data assigned to Testing set
percTST <- 0.3
#=========================================================
# Load data
load("multiEnvironment/prepData_multi.RData")
n <- nrow(Y)
nEnv <- ncol(Y)
# Creation of seed for repeated randomizations
set.seed(123)
seeds <- round(seq(1E3,1E6,length=m))
nTST <- round(percTST*n)
YNA <- vector("list",m)
nNA <- nEnv*nTST
for(k in 1:m)
{
set.seed(seeds[k])
YNA0 <- Y
if(nNA<n){ indexNA <- sample(1:n,nNA,replace=FALSE) }
if(nNA>=n){
nRep <- floor(nNA/n)
remain <- sample(1:n,nNA%%n,replace=FALSE)
a0 <- sample(1:n,n,replace=FALSE)
indexNA <- rep(a0,nRep)
if(length(remain)>0){
a1 <- floor(length(indexNA)/nTST)*nTST
a2 <- nNA - a1 - length(remain)
bb <- sample(a0[!a0%in%remain],a2,replace=FALSE)
noInIndexNA <- c(rep(a0,nRep-1),a0[!a0%in%bb])
indexNA <- c(noInIndexNA,bb,remain)
}
}
indexEnv <- rep(1:nEnv,each=nTST)
for(j in 1:nEnv) YNA0[indexNA[indexEnv==j],j] <- NA
YNA[[k]] <- YNA0
}
# Save YNA matrix
save(YNA,file="multiEnvironment/YNA_CV2_multiEnv.RData")
After running the code to generate partitions for either CV1 or CV2 scenarios, the following script (fitModels_multi.R) can be run to fit the models repeatealy for all partitions. In all multi-environment models, main effect of 'environment' will be regarded as fixed effect.
The code runs a single partition for each model either for CV1 or CV2. These specifications need to be passed in variables mod
, CV
, and part
.
setwd("/mnt/home/lopezcru/GS")
rm(list=ls())
library(BGLR)
#=========================================================
# User specifications
#=========================================================
# Choose one model. 1: single; 2:across; 3:MxE; 4:R-Norm
mod <- 4
# Type of CV. 1:CV1; 2:CV2
CV <- 1
# Partition number
part <- 1
#=========================================================
# Read arguments passed from command line
args=(commandArgs(TRUE))
if(length(args)==0){
cat('No args provided',"\n")
}else{
for(i in 1:length(args)){
eval(parse(text=args[[i]]))
}
}
# Load data
load("multiEnvironment/prepData_multi.RData")
load(paste0("multiEnvironment/YNA_CV",CV,"_multiEnv.RData"))
n <- nrow(Y); nEnv <- ncol(Y)
# Models
models <- c("Single","Across","MxE","R-Norm")
model <- models[mod]
# Number of iterations and burn-in for Bayesian models
nIter <- 30000; burnIn <- 2000
YNA0 <- YNA[[part]]
yNA <- as.vector(YNA0)
#--------------------------------------------------------
# 1. Single environment (within-environment) model
#--------------------------------------------------------
if(model=="Single")
{
YHat <- matrix(NA,nrow=nrow(Y),ncol=ncol(Y))
ETA <- list(G=list(V=eigen_G$vectors,d=eigen_G$values,model='RKHS'))
for(env in 1:nEnv){
fm <-BGLR(y=YNA0[,env],ETA=ETA,nIter=nIter,burnIn=burnIn)
YHat[,env] <- fm$yHat
}
}
#--------------------------------------------------------
# 2. Across-environments model
#--------------------------------------------------------
if(model=="Across")
{
ETA <- list(list(~envID-1,model="FIXED"))
ETA[[2]] <- list(V=eigen_G0$vectors,d=eigen_G0$values,model='RKHS')
# Model Fitting
fm <- BGLR(y=yNA,ETA=ETA,nIter=nIter,burnIn=burnIn)
YHat <- matrix(fm$yHat,ncol=nEnv)
}
#--------------------------------------------------------
# 3. MxE interaction model
#--------------------------------------------------------
if(model=="MxE")
{
ETA <- list(list(~envID-1,model="FIXED"))
ETA[[2]] <- list(V=eigen_G0$vectors,d=eigen_G0$values,model='RKHS')
# Adding interaction terms
for(env in 1:nEnv){
eigen_G1 <- MxE_eigen[[env]]
ETA[[(env+2)]] <- list(V=eigen_G1$vectors,d=eigen_G1$values,model='RKHS')
}
# Model Fitting
fm <- BGLR(y=yNA,ETA=ETA,nIter=nIter,burnIn=burnIn)
YHat <- matrix(fm$yHat,ncol=nEnv)
}
#--------------------------------------------------------
# 4. Reaction-Norm model
#--------------------------------------------------------
if(model=="R-Norm")
{
ETA <- list(list(~envID-1,model="FIXED"))
ETA[[2]] <- list(V=eigen_G0$vectors,d=eigen_G0$values,model='RKHS')
ETA[[3]] <- list(V=eigen_GE$vectors,d=eigen_GE$values,model="RKHS")
# Model Fitting
fm <- BGLR(y=yNA,ETA=ETA,nIter=nIter,burnIn=burnIn)
YHat <- matrix(fm$yHat,ncol=nEnv)
}
# Save results
outfolder <- paste0("multiEnvironment/CV",CV,"/",model)
if(!file.exists(outfolder)) dir.create(outfolder,recursive=T)
save(YHat,file=paste0(outfolder,"/outPRED_multiEnv_partition_",part,".RData"))
Code above will run a single combination of partition-model-CV, thus when running, for instance, several models for both CV1 and CV2, some parallelzation of jobs is needed for speeding of computation. The bash code called run_jobs_multi.sh will submit many jobs depending of the core capacity of the computer. Jobs will be sent by chunks whose size is specified in variable nb
(for instance, nb=10
will run 10 jobs at the time). After jobs in the chunk are done, another chunk will be submited to be run. Variables seq1=2
, seq2=4
, and seq3=100
specify the 2 CV types, 4 models, and 100 partitions, respectiely.
The bash shell script needs to be saved in the same directory as the R script 'fitModels_multi.R' and it can be run from command line as
sh run_jobs_multi.sh &
The code below will retrieve results for all models fitted previously showing the within-environment correlation for all fitted models
setwd("/mnt/home/lopezcru/GS")
rm(list=ls())
library(ggplot2)
library(reshape)
#=========================================================
# User specifications
#=========================================================
# Type of CV. 1:CV1; 2:CV2
CV <- 2
#=========================================================
models <- c("Single","Across","MxE","R-Norm")
# Load data
load("multiEnvironment/prepData_multi.RData")
load(paste0("multiEnvironment/YNA_CV",CV,"_multiEnv.RData"))
# Calculate within-environment correlation
outCOR <- vector("list",length(models))
names(outCOR) <- models
for(mod in seq_along(models))
{
outcor <- c()
for(part in 1:length(YNA)){
filename <- paste0("multiEnvironment/CV",CV,"/",models[mod],"/outPRED_multiEnv_partition_",part,".RData")
if(file.exists(filename))
{
load(filename)
YNA0 <- YNA[[part]]
tmp <- rep(NA,ncol(YNA0))
for(env in 1:ncol(YNA0)){
indexTST <- which(is.na(YNA0[,env]))
tmp[env] <- cor(Y[indexTST,env],YHat[indexTST,env])
}
outcor <- rbind(outcor,tmp)
}
}
colnames(outcor) <- paste0("Env ",colnames(YNA0))
rownames(outcor) <- NULL
outcor <- data.frame(model=models[mod],outcor)
outCOR[[mod]] <- outcor
}
outCOR <- outCOR[!sapply(outCOR,is.null)]
# Calculate means and SD's
(means <- t(do.call("rbind",lapply(outCOR,function(x)apply(x[,-1],2,mean)))))
(sds <- t(do.call("rbind",lapply(outCOR,function(x)apply(x[,-1],2,sd)))))
write.csv(rbind(means,colnames(sds),sds),file=paste0("multiEnvironment/Accuracy_avg_CV",CV,"_multiEnv.csv"))
toplot <- do.call("rbind",lapply(outCOR,function(x)melt(x,id="model")))
png(paste0("multiEnvironment/Accuracy_distn_CV",CV,"_multiEnv.png"),height=350)
ggplot(toplot,aes(x=model,y=value,fill=variable)) + geom_boxplot()+
labs(fill="Env",y="Accuracy",title=paste0("Correlation between observed and predicted values. CV",CV))
dev.off()
Tables below are the results of running 100 partitions with nIter=30000
and burnIn=2000
. The mean and standard deviation (in parenthesis) across partitions are presented.
Cross Validation 1. CV1
Single-Env | Across-Env | MxE | RNorm | |
---|---|---|---|---|
Env 2 | 0.485(0.049) | 0.441(0.052) | 0.460(0.050) | 0.461(0.049) |
Env 4 | 0.377(0.055) | 0.395(0.053) | 0.382(0.055) | 0.382(0.055) |
Env 5 | 0.441(0.056) | 0.382(0.057) | 0.412(0.054) | 0.409(0.055) |
Cross Validation 2. CV2
Single-Env | Across-Env | MxE | RNorm | |
---|---|---|---|---|
Env 2 | 0.485(0.049) | 0.629(0.032) | 0.647(0.033) | 0.642(0.033) |
Env 4 | 0.375(0.058) | 0.602(0.042) | 0.591(0.043) | 0.585(0.044) |
Env 5 | 0.442(0.048) | 0.493(0.045) | 0.529(0.042) | 0.528(0.043) |
- Jarquín, D., Crossa, J., Lacaze, X., Du Cheyron, P., Daucourt, J., Lorgeou, J., … de los Campos, G. (2014). A reaction norm model for genomic selection using high-dimensional genomic and environmental data. Theoretical and Applied Genetics, 127(3), 595–607.
- Lopez-Cruz, M., Crossa, J., Bonnett, D., Dreisigacker, S., Poland, J., Jannink, J.-L., … de los Campos, G. (2015). Increased Prediction Accuracy in Wheat Breeding Trials Using a Marker × Environment Interaction Genomic Selection Model. G3: Genes, Genomes, Genetics, 5(4), 569–582.