title: Variance component estimation author: Yutaka Masuda date: September 2019 subject: "Introduction to BLUPF90 suite programs" tags: [introduction,tutorial] ...
There is a situation where the residual variances vary over conditions. This is known as heterogeneous residual variances. There are 2 types of modeling for the heterogeneous residual variances in AIREMLF90.
- The residual variance is related to a covariate. For example, the residual variance for body weight increase by age of a month. The residual variance can be described as a function of a covariate (age).
- The residual variance differs by class. For example, in a test-day random-regression model, we usually assume different residual variance in each lactation stage. The residual variances are independent of each other.
Although AIREMLF90 is designed to handle the first case, the program can also handle the second case with a trick. We will see both cases with a numerical example in REML. GIBBS3F90 (and THRGIBBS3F90) supports the second case.
In this example, we asuume the following model:
$$
y_{ijkl} = F_{i} + S_{j} +Y_{k} + u_{l} + e_{ijkl}
$$
where
In this section, we will use simulated files similar to the previous section. You can download the files from Github.
simdata2.txt
: data filesimped2.txt
: pedigree file
The pedigree file contains 3 columns: animal, sire, and dam. The data file has 12 columns as described below.
Column Item type description
1 Animal ID integer Same as in pedigree (4641 animals) 2 Sire ID integer Same as in pedigree 3 Dam ID integer Same as in pedigree 4 Weight real Not used here 5 Mu integer All 1: not used here 6 Farm integer Class effect (155 levels) 7 Sex integer Class effect (2 levels) 8 Year integer Class effect (11 levels) 9 Obs. 1 real Phenotype for trait 1 10 Obs. 2 real Phenotype for trait 2 11 Obs. 3 real Phenotype for trait 3 12 Obs. 4 real Phenotype for trait 4 13 Covariate real Related to residual variance 14 Class integer (Used in the next subsection)
Column 13 contains a real value as a covariate. In this artificial data, the residual variance is expected to be larger as the covariate becomes larger. The 14th column contains the heterogeneous-residual-variance class (3 levels), and it will be used in the next subsection.
The following parameter file is used.
DATAFILE
simdata2.txt
NUMBER_OF_TRAITS
1
NUMBER_OF_EFFECTS
4
OBSERVATION(S)
9
WEIGHT(S)
4
EFFECTS:
6 155 cross
7 2 cross
8 11 cross
1 4641 cross
RANDOM_RESIDUAL VALUES
100
RANDOM_GROUP
4
RANDOM_TYPE
add_animal
FILE
simped2.txt
(CO)VARIANCES
100
OPTION hetres_pos 13
OPTION hetres_pol 4.5 0.5
The options define the model for heterogeneous residual variances.
-
OPTION hetres_pos
= the position of covariate$x$ in the data file. Intercept$b_0$ is implicitly considered in the program. -
OPTION hetres_pol
= the starting values for$b_{0}$ and$b_{1}$ .- You should supply the values for all coefficients including intercept.
- In this example, the initial value will be
$\exp(4.5 + 0.5) = 148.4$ for$x = 1$ .
In the above case, we define only 1 regression coefficient in hetres_pos
but the program adds the
intercept. If you put 2 numbers in hetres_pos
and also put 2 initial values in hetres_pol
, the
program doesn't add the intercept. In such a case, the model doesn't contain any intercept but does
contain user-supplied covariates only.
Note that these options do not work with OPTION EM-REML
.
The current version of AIREMLF90 will tell you this limitation when the program meets this condition.
You can run AIREMLF90 with the above parameter file. The following output will be shown.
Final Estimates
Genetic variance(s) for effect 4
39.384
new R
1 -th trait: 1 -th coefficient = 4.25520168650538
1 -th trait: 2 -th coefficient = 5.886409914144559E-002
inverse of AI matrix (Sampling Variance)
16.773 -0.12514 0.77591E-02
-0.12514 0.41370E-02 -0.16778E-02
0.77591E-02 -0.16778E-02 0.10722E-02
Correlations from inverse of AI matrix
1.0000 -0.47507 0.57858E-01
-0.47507 1.0000 -0.79664
0.57858E-01 -0.79664 1.0000
SE for G
4.0955
SE for R
0.64319E-01
The output shows
One drawback of this analysis is that the calculations of the standard error for the residual variance
are not easy. The AI matrix contains information for
The heterogeneous residual variances are accounted for by class. We assume 3 levels in the class. The 14th column in the above data contains the level for each observation. AIREMLF90 doesn't directly accept the level number, and we should convert it to a different format.
Having 3 levels is equivalent to defining 3 regression coefficients without intercept. We now assume the following parameterization without intercept. $$ \sigma_e^2 = \exp(b_1 x_1 + b_2 x_2 + b_3 x_3 ) $$
If the observation has the residual variance level 1, we have
- tutorials:
simdata2a.txt
: extended data file
The first 10 rows are shown below.
1 0 0 1.00 1 67 1 1 85.0 0.0 92.1 91.0 0.46 1 1 0 0
2 0 0 0.98 1 144 1 1 115.2 103.9 94.8 90.6 1.52 2 0 1 0
3 0 0 1.04 1 92 1 1 93.0 107.4 114.9 107.0 0.00 1 1 0 0
4 0 0 0.97 1 26 1 1 84.0 91.8 93.7 107.1 0.61 1 1 0 0
5 0 0 0.96 1 83 1 1 100.9 87.9 88.9 91.5 0.55 1 1 0 0
6 0 0 0.99 1 62 1 1 69.8 71.5 69.3 0.0 0.88 1 1 0 0
7 0 0 0.97 1 40 1 1 101.2 0.0 92.1 88.6 0.27 1 1 0 0
8 0 0 1.07 1 82 1 1 77.4 72.7 84.1 105.2 1.69 2 0 1 0
9 0 0 1.01 1 71 1 1 107.8 87.0 97.8 76.0 2.89 3 0 0 1
10 0 0 0.95 1 114 1 1 94.4 100.5 103.8 89.8 1.26 2 0 1 0
The column 15, 16 and 17 contains 1 or 0. In column 15, it is 1 if the level is 1, otherwise 0. In column 16, it is 1 if the level is 2, otherwise 0. In column 17, it is 1 if the level is 3, otherwise 0. With this data file, the parameter file is as follows.
DATAFILE
simdata2a.txt
NUMBER_OF_TRAITS
1
NUMBER_OF_EFFECTS
4
OBSERVATION(S)
9
WEIGHT(S)
4
EFFECTS:
6 155 cross
7 2 cross
8 11 cross
1 4641 cross
RANDOM_RESIDUAL VALUES
100
RANDOM_GROUP
4
RANDOM_TYPE
add_animal
FILE
simped2.txt
(CO)VARIANCES
100
OPTION hetres_pos 15 16 17
OPTION hetres_pol 4.5 4.5 4.5
The results from the above parameter file are shown here.
Final Estimates
Genetic variance(s) for effect 4
39.296
new R
1 -th trait: 1 -th coefficient = 4.28231169726656
1 -th trait: 2 -th coefficient = 4.36065608258819
1 -th trait: 3 -th coefficient = 4.38963108448880
inverse of AI matrix (Sampling Variance)
16.735 -0.11982 -0.11288 -0.10685
-0.11982 0.32743E-02 0.75391E-03 0.69459E-03
-0.11288 0.75391E-03 0.31352E-02 0.67693E-03
-0.10685 0.69459E-03 0.67693E-03 0.29536E-02
Correlations from inverse of AI matrix
1.0000 -0.51184 -0.49281 -0.48059
-0.51184 1.0000 0.23530 0.22335
-0.49281 0.23530 1.0000 0.22245
-0.48059 0.22335 0.22245 1.0000
SE for G
4.0909
SE for R
0.57221E-01
The estimate for the level 1 is
The methods explained here will not work well for multiple-trait models or complicated models with many variance components.
REML methods allow statistical testing of variance components. For instance, imagine that we have to choose one of two competing models:
- model 1 not including the maternal permanent environmental effect
- model 2 including maternal permanent effects.
The Likelihood Ratio Test (LRT) checks if the extra random effect, with associated variance
component gives a better fit of the model, against not fitting it. The model without the extra random effect
is the null (
Output of AIREMLF90 under
pchisq(x-y,1,lower.tail=FALSE)/2
Classical applications of LRT in quantitative genetics include testing
I tested if genetic effects (
Under the complete model H1 (
-2logL = 825227.48
Then we run the reduced model H0 with a parameter file that will not include the genetic effect (
-2logL = 828095.614456413
then the test in R is:
pchisq(828095.614456413-825227.48,1,lower.tail=FALSE)/2
[1] 0
the p-value is so small that we cannot see the difference from 0. We can take the log, then transform to a scale of
pval=pchisq(828095.614456413-825227.48,1,lower.tail=FALSE,log.p=TRUE)/2
> pval
[1] -719.137
-log10(exp(pval))
[1] 312.3172
So, the p-value is