-
Issue about UKBB imputation:
- check whether "raw genotypes" were imputed in "imputed genotypes"
- see how imputation can be done via the same UKBB procedure
- make sure all the wholes are filled in the "imputed genotypes"
- for holes in "raw genotypes", can treat them as empty genotypes, impute and then retain the holes
- check whether "raw genotypes" were imputed in "imputed genotypes"
-
true SNP effect size correlation—we coined the term "genetic effect correlation" to refer to it— is our target of estimation
-
Empirical correlation from GWAS effect size estimate should be a most direct estimate of "genetic correlation", but LD will hinge the estimation. One way is to disentangle "GWAS effect size correlation estimate" from LD, the other way is to use external biological knowledge or information to boost the power of estimation
- How do we decompose the "genetic effect correlation"? Let's say two SNP has strong effect correlation, which means that they are functionally similar across all the traits: e.g. affecting the same gene, having similar function on cell type. Further understanding is that two SNPs not only show co-causality behavior, but also have similar pattern for directionality of effect——either beneficial or harmful simultaneously. It looks like a two step procedure.
-
Genetic effect correlation is SNP-wise, meaning that the effect size of two SNPs is assumed to be correlated for the same trait, instead of being treated IID. Its relationship to genetic correlation is that one is column wise correlation , the other is row-wise correlation of the effect size matrix
$B_{q\times p}$ (q=trait number, p=SNP number) - There are two interpretations for co-causality: we need to construct a binary true causality matrix
$\pi_{q\times p}$ for q traits and p SNPs, and correspondingly a co-causality matrix$C_{p\times p}$ . The first is that two columns, respectively referring to SNP 1 and SNP 2, has correlation, so that$C_{1,2}=\text{P(SNP 1 is causal, SNP B is causal)-P(SNP 1 is causal)P(SNP 2 is causal)}=cov(\pi_{.1},\pi_{.2})\neq 0$ . The second is that two rows, respectively as trait A and trait B, has more overlapping level of causal SNPs than that is expected by random. Co-causality only means that two SNPs is functionally related, so that they are either causal simultaneously for this trait or non-causal for the other trait. The directionality of effect size is not yet under consideration. We can say that if we denote any non-zero effect size as one, then the "genetic effect correlation" is exactly the co-causality between two SNPs.
Find a way to estimate
Assuming joint normality of effect size, i.e.
Substituting the MVN assumption to some multivariate point-normal assumption, by introducing a co-causality matrix
Then find the analytical solution for the conditional expectation. If unable, find a Bayesian solution and a MCMC implementation.
Suppose we have used some penalized regression method like LASSO/Lassosum or regularized prior Bayesian method like FINEMAP to estimated the joint effect size. But two issues with these method are that: first, they are biased, second, they are underpowered, because their penalty/regularization/sparsity assumption tends to shrink small non-zero effects to zeros. Therefore, we may believe that the non-zero effects given by the methods are highly credible, but zeros are not credible enough. This scenario is similar to that "drop-out" problem in single-cell transcriptomics.
Graph Convolution Network is a learning algorithm designed for this kind of semi-supervised problem with some "missing labels". GCN is a branch under Graph Neural Network. GNN, based on graph embedding that maps nodes into d-dimensional Euclidean Space
- single cell sequencing "imputation":
- features=binary single-cell transcription status matrix for all genes across all cell samples
- known labels=a gene's 1s in the feature, unknown = a gene's 0s in the feature
- Or, known labels=a cell's 1s in the feature, unknown = a cell's 0s in the feature
- adjacent matrix: (from database) GTEx tissue specific co-expression, PPI
- gene expression (high or low) binary prediction:
- features=quantitative tissue expression level for all genes across all patients
$N\times P$ - known labels = a gene's binary expression level for this tissue (e.g. this gene is over/under-expressed in liver than other tissues) across part of the patients
$N_1\times1, N_1\leq N$ ; unknown labels =that are unmeasured for other patients - Or, known labels = a patient's binary expression level for this tissue across part of the genes
$P_1\times1, P_1<P$ ; unknown labels =that are unmeasured for other genes - adjacent matrix: (from database) GTEx tissue specific co-expression, PPI
$P \times P$
- features=quantitative tissue expression level for all genes across all patients
- gene priorization / disease association prediction:
- features = quantitative expression level for all genes
$P\times N$ - known labels = disease association profile for a subset of genes from database
$P_1\times q, P_1<P$ - unknown labels = disease association profile for other genes compliment to the subset
$(P-P_1)\times q$ - adjacent matrix: (from database) PPI, gene co-expression; (from data) correlation profile of the feature
- features = quantitative expression level for all genes
**GCN may be a promising option in our problem, with
Functional similarity=Co-causality+Posterior effect size correlation
Anyway, we will need to find a way to estimate
- First, only by analytical solution using only
$\hat B$ and$LD$ with some statistical assumption. - Second, only by external information, such as functional annotation for all the SNPs, and then do correlation analysis, or logistic regression. The purpose is to estimating causality matrix
$\Pi$ . - Third, some weighted average or regression integration of method of the first and the second approach's estimate.
- Fourth, as there are many gene-level interaction/functional similarity information available in database, find a way to estimate through the mediation of gene. Biologically, two SNPs located in the same gene must be co-causal (functional related), and if some genes are functionally related, we may say the SNPs within different genes are also like so. The problem will be on the SNPs that are non-genic, or can not be mapped to any gene.
Two more points are noteworthy. First, biologically, co-causality (functional relation) is easier to discover and estimate than functional similarity (genetic effect correlation). Second, a fundamental assumption is that co-causality and genetic effect correlation is evenly distributed across all traits; or rather, two SNPs shares a fixed co-causality/genetic correlation regardless of the trait category/domain we are investigating in. This is not a very strong (?) assumption as we are not saying two SNPs have the same magnitude of effect size across the same trait. Even with a fixed genetic effect correlation, two SNPs can still be simultaneously more related to one trait domain (larger
sample size = 10000
SNP number = 451 of chromosome one block one
using genotyped SNPs, some remaining holes imputed by median, then generate phenotype
causal SNP proportion = 20%
causal SNP number = 93
missing genotyped first imputed, then generate phenotype
For one trait,
with
Replicate, and generate 100 traits, by setting the causal SNPs to perfectly (100%) overlap
-
Do lasso fine mapping, using CV+1lamda.se from cv.glmnet
-
Obtain a
$\hat B_{100\times 451}$
N=N |
Null | Non-Null | ||
---|---|---|---|---|
Exclude (Not Reject) | TN=32783 | FN=4836 | 37619 | |
Include (Reject H |
FP=3017 | TP=4464 | 7481 | FD rate =0.40 |
35800 | 9300 | |||
FP rate (type1 error)=0.084 | FN rate (type 2) =0.52, POWER=0.48 |
**Conclusion: **
- the power is not bad in this scenario. FDR is not controlled.
- Knock-down method may be used to control FDR.
-
Training input:
$\hat B_{73\times 450}$ as feature, SNP1's non-zero Lasso effect estimate' as label$\hat\beta_{73\times 1}$ , a graph defined based on$\Sigma$ -
Supervised learning, fitting model weights
-
model: the paper Analysis of Gene Interaction Graphs as Prior Knowledge for Machine Learning Models : gene-graph-analysis
-
Prediction input:
$\hat B_{27\times 450}$ -
Prediction output: an updated label for SNP1's zero Lasso effect estimate
$\hat\beta_{27\times 1}$
- alter
$h^2$ and$\pi$ , therefore changing per-SNP heritability$h^2/(m\times \pi)$ , m = 451 here. - do univariate Lasso for each trait
Assume data
\ is\ estimator\ for\
\ is\ the\ ridge\ matrix for $
Here we actually assume that we only have perfectly-overlapping 2 causal SNPs for 2 traits, and we know their positions.
Assume a simple structure of 2 traits ✖ 2 SNPs for n individuals
$(\vec y^{(1)},\vec y^{(2)}){n\times 2}=X{n\times 2}B_{2\times 2}+(\vec\epsilon^{(1)},\vec \epsilon^{(2)})_{n \times 2}$
For individual i,
Assuming perfect co-causality, i.e. SNP 1 and 2 are causal/non-causal SNPs simultaneously.
Assuming perfect genetic overlapping, i.e. trait 1 and trait 2 have completely overlapping causal SNPs, i.e. SNP 1 and SNP2.
Assuming random effect model, i.e. $B=\begin{pmatrix} \beta_1^{(1)} & \beta_1^{(2)} \ \beta_2^{(1)}& \beta_2^{(2)} \end{pmatrix}=(\vec \beta^{(1)}, \vec\beta^{(2)}){trait}=(\vec \beta_1,\vec \beta_2)^T{SNP}$ is a random matrix: each column is a random vector for each trait; each row is also a random vector for a SNP.
Marginally, for SNP-wise there is genetic effect correlation
trait-wise: assuming two traits are equally heritable, i.e. identical
$(\beta_j^{(1)},\beta_j^{(2)})^T \sim N(\vec0,\begin{pmatrix} \sigma^2_g & \rho_g \ \rho_g& \sigma^2_g \end{pmatrix}),j=1,2$ (
SNP-wise
$(\beta_1^{(k)},\beta_2^{(k)})^T \sim N(\vec0,\begin{pmatrix} \sigma^2_g & \rho_s \ \rho_s& \sigma^2_g \end{pmatrix}),j=1,2$ (
The complete but complicated genetic architecture, i.e. the joint distribution for random matrix B can be modelled by a copula function
To further simplify, assuming joint normal: different trait and different SNP, then no correlation in effect size
unfinished
$(\beta_i^{(k)},\beta_j^{(l)})^T \sim N{\vec0,\begin{pmatrix} \frac{h^2_k}{m_k} & \rho_{g_{kl}}\rho_{s_{ij}} (\frac{h^2_k}{m_k} \frac{h^2_l}{m_l})^{\frac{1}{2}} \ \rho_{g_{kl}}\rho_{s_{ij}} (\frac{h^2_k}{m_k} \frac{h^2_l}{m_l})^{\frac{1}{2}} & \frac{h^2_l}{m_l} \end{pmatrix}}$
$vec(\bold{B_{11:22}})=\begin{pmatrix} \beta_1^{(1)} \ \beta_1^{(2)} \ \beta_2^{(1)}\ \beta_2^{(2)} \end{pmatrix}\sim N(\vec 0,\Sigma), \Sigma=\begin{pmatrix} \frac{h^2_1}{m_1} & \rho_{g_{12}}(\frac{h^2_1}{m_1}\frac{h^2_2}{m_2})^{\frac{1}{2}}&\rho_{s_{12} } & 0 \ \rho_{g_{12} }(\frac{h^2_1}{m_1}\frac{h^2_2}{m_2})^{\frac{1}{2}}& \frac{h^2_2}{m_2} & 0 &\rho_{s_{12} }\ \rho_{s_{12} } & 0 &\frac{h^2_1}{m_1} & \rho_{g_{12} }\ 0 &\rho_{s_{12} } & \rho_{g_{12} }&\frac{h^2_2}{m_2}\end{pmatrix}$
i.e. for two effect size,
Further assuming no G✖E interaction, and no cross trait/cross individual environmental correlation (like in GREML/LDSC model)
denote
For individual 1 and 2,
(here j and k are both SNP index)
here, for our two SNP scenario,
$A=\begin{bmatrix} \sigma^2_g(x_{11}^2+x_{12}^2)+\rho_s(2x_{11}x_{12})&\sigma^2_g(x_{11}x_{21}+x_{12}x_{22})+\rho_s(x_{11}x_{22}+x_{12}x_{21})\\sigma^2_g(x_{11}x_{21}+x_{12}x_{22})+\rho_s(x_{11}x_{22}+x_{12}x_{21}) & \sigma^2_g(x_{21}^2+x_{22}^2)+\rho_s(2x_{21}x_{22})\end{bmatrix}$
under conventional GRM model,
and genotypes have been SNP-wise normalized i.e.
$A=\sigma^2_g\begin{bmatrix} x_{11}^2+x_{12}^2&x_{11}x_{21}+x_{12}x_{22}\x_{11}x_{21}+x_{12}x_{22}& x_{21}^2+x_{22}^2\end{bmatrix}$
$=m\sigma^2_g\begin{bmatrix} \overline x_{1.}^2&\overline{ x_{1.}x_{2.}}\\overline{ x_{1.}x_{2.}}& \overline x_{2.}^2\end{bmatrix}=m\sigma^2_g G$
$ \overline x_{i.}^2\approx1$ if m is large enough.
This approximation is considered to be accurate when the trait of polygenic enough, i.e. when the number of causal SNPs
$=XE\begin{pmatrix} \beta_1^{(1)}\beta_1^{(2)} & \beta_1^{(1)}\beta_2^{(2)} \ \beta_2^{(1)}\beta_1^{(2)}& \beta_2^{(1)}\beta_2^{(2)} \end{pmatrix}X^T+\sigma^2_eI$
denote
$S=X\begin{pmatrix}\rho_g & 0\ 0& \rho_g \end{pmatrix}X^T=m\rho_gG$
Suppose we know $\rho_s$ , it will provide a more accurate estimate of A and then a more accurate estimate of $h^2$ and $\rho_g$ , under REML
We have
denote
For individual 1 and 2,
But now m is much larger than 2.
In GCTA, suppose the we have genome-wide SNP genotype data for n individual as
Estimate GRM
where
Genotypes have been SNP-wise normalized i.e. given j,
Assuming polygenicity, then
The key is to estimate $\overline{\rho_s^{(jk)}x_{1j}x_{2k}}{j\neq k}$ and $\overline{\rho_s^{(jk)}x{1j}x_{1k}}_{j\neq k}$.
There are in total m(m-1)/2
Assuming a prior $\rho_s^{(jk)}\sim N(\mu_s,\sigma^2_s) $ for different j and k
Further assuming this distribution is independent from that of the LD of genotypes (key assumption),
$\overline{\rho_s^{(jk)}x_{1j}x_{1k}}{j\neq k}\approx E(\rho_s^{(jk)})E(x{1j}x_{1k})\approx\mu_s\mu_r$, where
$\overline{\rho_s^{(jk)}x_{1j}x_{2k}}{j\neq k}\approx E(\rho_s^{(jk)})E(x{1j}x_{2k})\approx\mu_s\mu_rs_{12}$,
Thus the new covariance matrix
Is it possible to estimate
It is only possible when we have multiple trait information, like GWAS for multiple correlated traits.
Denote the non-zero entries of
We define:
- SNP-wise co-causality
$\rho_c$ as the row-wise correlation of$\Pi_{p\times r}$
we have
empirically,
It is very easy to calculate. Supposing there are 100 traits, SNP1 and SNP2 are both causal SNPs for the first 5 traits, and non-causal for the rest. Then
$\rho_c=(0.05-0.050.05)/(0.05^20.95^2)=1$, which is perfect co-causality.
If two SNPs are functionally independent (in terms of causality),
then
- trait-wise genetic overlapping as the column-wise correlation of
$\Pi_{p\times r}$
Similarly, we have
for a unspecified trait, we model SNP1 and SNP2's joint effect size as (
and their causality status
Assuming marginal effect size distribution for one SNP is of point-normal shape regardless of the trait,
For trait k, per causal SNP heritability
heritability divided by the number of causal SNPs (=No. of SNPs*polygenicity), which measure the degree of micro-effect, i.e."infinitesmality"
denote
then
Denote
Suppose we know from biology functional information that these two SNPs are functionally closely related, like within the same gene region, then we may determine
If we also know the pleiotropy level i.e.
$(\beta_1, \beta_2)^T\sim {\vec0,\begin{pmatrix}p_1\overline {h^2_m} & p_{12}\rho_s \ p_{12}\rho_s & p_2\overline{h^2_m}\end{pmatrix}}$
Problems
- ignoring other responses (potentially useful information)
- high LD (co-linearity)
Robert Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society. Series B (Methodological), pages 267–288, 1996.
Efron et al. (2004) proposed least angle regression selection (LARS)
-
penality:
for trait l,
do it trait by trait
- realization: R package glmnet or lars
- penalty form
let
-
realization
R package lsgl : set
$\alpha=1$
Yuan & Lin (2007) http://www.columbia.edu/~my2550/papers/glasso.final.pdf
M. Yuan and Y. Lin. Model selection and estimation in regression with grouped variables. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 68(1):49–67, 2006.
-
general penalty: J groups of features
(1).assets\image-20200416230339788.png)
X
jis an n×pjmatrix corresponding to the jth factor and βjis a coefficient vector of size pj, j=1,...,J. -
This is intermediate between the l1-penalty that is used in the lasso and the l2-penalty that is used in ridge regression
-
most common case
$ p_1=...=p_J =1$
it is reduced to normal Lasso
- most used norm (that is used in this paper):
$K_j=p_jI_{p_j}$ , a weighted L-2 norm for each predictor group
This procedure acts like the lasso at the group level: depending on λ, an entire group of predictors may drop out of the model.
- realization: R package gglasso or grplasso
https://statweb.stanford.edu/~tibs/ftp/sparse-grlasso.pdf
- penalty: L groups of features
β = (β1,β2,...βl) is the entire parameter vector. For notational simplicity we omit the weights
- realization: R package SGL
K. Lounici, M. Pontil, A. B. Tsybakov, and S. van de Geer. Taking advantage of sparsity in multi-task learning. arXiv preprint arXiv:0903.1468, 2009.
-
- penalty: M features, T tasks
- Assuming the set of relevant predictor variables is the same across the different equations (structured sparsity assumption)
- data matrix for a task can be different
- don't define groups:
$p_1=...=p_M=1$ - general mixed L-2,p norm
- used in the paper: mixed L-2,1 norm
(1).assets\image-20200416135306505-1587048586087.png)
-
extension: grouping features
there are M groups,
$p_1+...+p_M=p$ we still used L-2,1 norm, but
$\beta_j$ will be a T by pjmatrix rather than a vector -
**It results in the selection of the same predictors across all responses. If a feature or a group of features is selected for one response, then it is selected for all responses. **
-
realization: R package lsgl
with penalties of the form
let
By default,
$H=\alpha | \boldsymbol{B} |{1,1}+(1-\alpha) \sum{j = 1}^p \sqrt r | \boldsymbol{\beta}_j |_2$
let
Y. Li, B. Nan and J. Zhu (2015) Multivariate sparse group lasso for the multivariate multiple linear regression with an arbitrary group structure. Biometrics. DOI: 10.1111/biom.12292
Y column groups, X row groups, intersection block groups
- penalty
the union of all groups inG does not need to contain all the elements of B,
For notational simplicity, special case is λjk = λ for all j, k.
-
note about matrix norm:
- mixed norm/ L-p,q norm
(1).assets\image-20200417011929795.png)
the most used one is L-2,1 norm
(1).assets\image-20200417012048855.png)
also the L-2,2 norm, i.e. Frobenius norm
(1).assets\image-20200417012231523.png)
- induced norm
(1).assets\image-20200417012342684.png)
the most used is spectrum norm, p-2 induced norm,
(1).assets\image-20200417012513642.png)
where
normally we have (1).assets\image-20200417012614818.png)
-
"Entrywise" norms
(1).assets\image-20200417012832227.png)
here the author used
$|B_g|_2$ as the p-2 "entrywise" norm, i.e. L-2,2 norm for the block$B_g$ ; but in general it is still kind of L-2,1 norm for$B$ , because of the sum across all groups. ("L2 plus sum" in the vector scenario)
- realization: R package MSGLasso
Suppose that the available features can be organized into overlapping subsets according to a notion of similarity, and that the features useful in one task are similar, but not necessarily identical, to those best suited for other tasks. In other words, a feature that is useful for one task suggests that the subset it belongs to may contain the features useful in other tasks
-
need to define M groups of features according to similarity
with a data matrix Φ
t∈ R^n×p^ for each task t ∈ {1,2,...,T}. Can be the same oneX here is the effect size matrix
-
Penality

- realization: Matlab https://github.com/UWKCL/soslasso
- defining groups/cluster of responses and permuting, to find best cluster
- within a group of traits, effect sizes are shared
Q, the total number of clusters
Let D = (D1,...,DQ) be a partition of the set{1,...,r}. (r traits)
Breiman and Friedman (1997)
- Do not define group structure, but use data-driven ones (like in MCEN)
- Define trait+SNP group structure, by clustering traits based on genetic correlation, and clustering SNPs based on similarity of functional annotations
- Do not even use group structure
-
$\rho_o$ =1,$|\rho_c|=1$ : all traits share the same set of causal SNPsActually here the |$\rho$| can not be calculated within groups, as there is no variance. There is just perfect covariance.
Fig.a
- 3 by 3
-
4 by 4
can be arbitrary G
sby GTwith "overlapping"A group is a row-column block-shaped intersection
sample size n = 1000
SNP number p = 451 of chromosome one block one
using genotyped SNPs, some remaining holes imputed by median, then generate phenotype
Assuming perfect genetic overlapping like fig.a
causal SNP proportion
causal SNP number
heritability interval
**In each simulation, select a interval for heritability→uniform random number→heritability for different trait **
We follow the I-IV scenarios to generate joint effect size matrix
trait number r=100
Note: For each k,
Plot power spectrum against per causal SNP heritability for different J kinds of penalized regression methods
changing
For a fixed interval of heritability, change
Using mean(h^2^)/$\pi_C$ as x-axis.
Plot J methods on one plot, with J ✖ 10 points and J curves
Make four plots, annotated by average h^2^
- Standard Lasso per trait
- Multivariate Lasso for all traits
- Multivariate Group Lasso: traits as one group, each SNP as a group--grouping SNPs, i.e. removing or selecting a SNP for all traits at the same time
- Sparse Multivariate Group Lasso: traits as one group, each SNP as a group--grouping SNPs, but adding some element-wise sparsity
generating each element of one by one, following normal distribution N(0,h^(k)2^/m)
- Standard Lasso
(1).assets\image-20200417153638735.png)
(1).assets\image-20200417153726813.png)
- MSGLasso
(1).assets\image-20200417154002746.png)
(1).assets\image-20200417153822759.png)
- Standard Lasso
(1).assets\image-20200417154918796.png)
(1).assets\image-20200417154834836.png)
- MLasso
(1).assets\image-20200417160847017.png)
(1).assets\image-20200417160909337.png)
- Going ahead to UKBB real data: need to analyze all individuals
- Chromosome by chromosome is better than LD block by block
- Possible big data processing methods:
- big data module in R
- C++ lang
- recursive incremental Bayesian learning
- recursive Lasso
- parallel running on server
- Multivariate penalized regression should be better than the univariate one, so the hyperparameter tuning is critical and challenging: CV, BIC, AIC, quite many criteria. After fine tuning, it should have better performance. Our goal is not to compare different methods under different simulation scenarios, but to determine whether a method works for certain scenario, and most importantly, works for the real data.
- L2,0 norm may be used in replace of L2,1 norm in sparse group lasso.
- It is very necessary to review the multivariate analysis in genetics (like PHEN, M-SKAT, ACAT) and the application of penalized regression in genetics.
- We can specify SNP group by functional annotations clustering and traits groups by genetic correlation clustering. Or most extreme: each/all as a group
- Using different
$\lambda$ for different traits indicates that they have different degrees of sparsity, i.e. polygenicity. - Computational intensity is a concern for Bayesian methods. So is it possible to formulate some Bayesian methods (like point-mixture normal) and then develop multivariate penalized regression optimization methods for it?
- Cross-validation may cause problem in lasso solution, which may have good prediction result but bad variable selection result, deviated from the purpose of fine mapping. BIC etc. may be good alternatives.
- The idea of a 2 stage procedure where stage 1 uses multivariate ridge regression to combine information across GWAS on multiple correlated phenotypes, and stage 2 used thresholding to introduce sparsity.
- Maybe do multivariate Lasso in a pairwise/bivariate manner (iteratively) for a whole chromosome, rather than in a Block-wise manner?
- Comparison should be made based on the same baseline, like using the same FWER/FDR control criterion. Then it is meaningful to compare the power and precision of different methods.
- A mimicking of Knockoff method to detect significant results: simulaye
- 4cpu+100gb running memory
- lsgl R package, no parallel
- hyperparameter per fitting
0<$\gamma$ < 1 ---- sparse group lasso
# SNP | # individuals | # traits | data loading time (min) | approximate running time per fitting (min) | |
---|---|---|---|---|---|
0.5,0.1 | 450 (block1) | 200,000 | 10 | <1 | ~15 |
0.5,0.1 | 450 | 200,000 | 50 | <1 | NA |
0.5,0.1 | 450 | 200,000 | 100 | <1 | NA |
0,0.1 | 12113 (chr22) | 200,000 | 1 | ~10 | ~5 |
0.5,0.1 | 12113 (chr22) | 200,000 | 10 | ~10 | NA |
From the preliminary analysis, it seems univariate Lasso can only handle ~10 traits simultaneously for several hundred of SNPs. Or, fewer traits, but more SNPs. In terms of CV, it cannot even handle 200k individual + one chromosome using CV-tuned
The computation difficulty of CV was also indicated in a published paper (Cule et al 2012).
Genotypes: Chromosome 22 (12113 SNPs) for 200k individuals
10 Phenotypes:
-
Quantitative: height, BMI, adjusted systolic blood pressure, adjusted diastolic blood pressure,
-
Binary: diabetes, heart attack, stroke, high BP, angina,
-
Ordinal heart problems
Some of them are highly correlated
Case ratio for the 5 binary traits:
[1] 0.048
[2] 0.032
[3] 0.023
[4] 0.015
[5] 0.274
We cannot use CV directly. We use CV on a smaller sample size like 20k, and then we get the optimal lambda. Here we use lambda.1se (largest value of lambda such that error is within 1 standard error of the minimum MSE), or we can use lambda.min. And then we fixed a Lasso on this single lambda, on a large sample size like 200k.
observation: lambda.min is always the largest one among all lambdas in CV
The number of variables selected by Lasso depends on
Treating binary variable as quantitative ones (using Gaussian instead of Poisson/binomial model in glmnet) renders computation plausible.
Also do CV on a smaller sample size of 20k
ridge--- 0<$\alpha$<1 ---lasso
less penalty, more variables selected
We cannot even do CV on 20k individuals for ridge. At this stage, we just arbitrarily give one.
How about effect size thresholding?
a) thresholding on |$\beta$|>0.01 --- variance explained by this SNP at least 10^-4^=0.01^2^ ~~ 1% of 10M SNPs = 100k causal SNPs ---- ~ 1/100k = 10^-5^ each SNP's variance explained
b) thresholding on |$\beta$|>0.0033--- variance explained by this SNP ~ 10^-5^=0.0033^2^
c) thresholding on |$\beta$|>0.001--- variance explained by this SNP ~ 10^-6^=0.001^2^
glmnet did not work for ridge in terms of CV on 20k samples
Traits | per SNP GWAS (p value<5*10^-8^) without LD clumping | per SNP GWAS with LD clumping (r^2^ = 0.2) | per SNP GWAS (p value<10^-7^) without LD clumping | Univariate Lasso (two-step) | Univariate Elastic Net ( |
Univariate Ridge (with thresholding |
---|---|---|---|---|---|---|
height | 2 | 2 | 4 | 13 | 57 | 211 250 308 |
BMI | 0 | 0 | 2 | 15 | 45 | 260 325 410 |
adjusted systolic blood pressure | 0 | 0 | 0 | 12 | 58 | 186 125 260 |
adjusted diastolic blood pressure | 2 | 1 | 3 | 13 | 47 | 220 246 276 |
diabetes | 0 | 0 | 0 | 3 | 170 | 0 17 256 |
heart attack | 2 | 2 | 3 | 2 | 176 | 0 20 333 |
stroke | 1 | 1 | 1 | NA | NA | NA |
high BP | 5 | 5 | 5 | NA | NA | NA |
angina | 0 | 0 | 0 | NA | NA | NA |
heart problems | 0 | 0 | 0 | 2 | 211 | 149 383 587 |
Note: NA due to some coding error?
Training=160k, Testing = 40k, other settings are the same.
Traits | Univariate Lasso (two-step) | Univariate Elastic Net ( |
Univariate Ridge (with thresholding) |
---|---|---|---|
height | |||
BMI | |||
adjusted systolic blood pressure |
Let's look at diabetes and heart attack:
1. Are the smaller subset of selected SNPs included in larger subset of selected SNPs?
The 3 and 2 SNPs selected for diabetes and heart attack by Lasso (CV) are also included in the 256 SNPs selected by Ridge (
Using selected = 0/3/253 (vs 0/17/256), now the 3 SNPs selected by Lasso are not overlapping with the "3", but are included in the "253".
【0/3/253 corresponds to
2. However, the 3 Lasso SNPs are not even included in those selected by
**That means the Lasso SNPs only have a moderate level of effect size magnitude. **
**3. Are the Lasso 3 and Ridge 3 have very small p-value in marginal testing? i.e. **Are the GWAS significant SNPs also selected by penalized regressions?
No! Their p-values are quite large. That means the GWAS hits are not selected by Lasso nor Ridge.
https://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-12-372
Distribution of it under homogenous normal error (GLM assumption):
Correspondingly, a T statistic
Estimate of variance of residual error:
under
when
Main challenge: algebra of large matrix
- Background setting
-
R package lmRidge for fitting
-
n=10k, p=1000 (from chr22) , 965 remained after MAF filtering 0.01% (exclude SNPs that are with too small MAF), phenotype=height
-
lambda=0.1 (arbitrarily given)
-
LD is troublesome for this package. Need to do some kind of LD clumping~~
After that for all remaining SNPs LD<=threshold=0.9, and 884 SNPs pass it.
-
Write an R function to calculate p value, and as v is large, we approximately have T
N(0,1) under H0~. -
Results: p-values for all SNPs that are fitted simultaneously by a Ridge model
Computation only takes several minutes.
min(pval) [1] 0.0001160818
summary(pval)
Min. 1st Qu. Median Mean 3rd Qu. Max 0.0001161 0.2383228 0.4787446 0.4910669 0.7479664 0.9977548 A little bit smaller than uniform distribution?
Comment:
The results show consistency as it is not far from expected null hypothesis. This means that the statistical modelling procedure is not wrong! Also QQ plot shows that it has some power. The next step is to ensure the testing is powerful enough. And I should further investigate the p-value distribution given different lambda and power/FPR analysis using different p-value threshold in a simulation study.
- simple thresholding on p value by different cutoffs. This can be studied in simulations.
- p-value+stability selection: in each of the bootstrap-like sub-sample, use p-value thresholding to select variables. Find the most stable subset.
-
various pre-screening approaches
- simulate null phenotypes capturing LD information
- obtain null p-value and
$\hat \beta$ distribution for all SNPs. Due to LD, some of them are correlated (similar) - thresholding on the difference between p value from real data and p value from null data
- Hoerl, Kennard & Baldwin 1975 paper
m = p = the number of predictors
This is not plausible due to ill-conditioning (high LD)---we cannot even get OLS.
rationale of this approach:
(this can be shown by differentiation)
2. Heritability-based extension of this approach:
Under certain conditions (orthonormal X and standardized Y):
For genotype matrix X, this means standardization plus no LD!
- If we further assume (genetic architecture, i.e. true model as) complete polygenicity so that all these p SNPs included follow a effect size distribution
$N(0,\sigma^2_g)$ , where$\sigma^2_g=h^2/m$ by CLT when m is large enough. (under complete polygenicity,No. of causal SNPs used=p, total No. of causal SNPs=total No. of SNPs=m)
-
Or if we assume a point-normal effect size distribution as true model:
$\beta_j \sim (1-\pi)\delta+\pi N(0,\sigma^2_g)$ , where$h^2=\sigma^2_gm\pi$
The above two are the same because of assuming that the heritability are distributed evenly across genome.
If we assume that the proportion of causal SNPs in p SNPs are not global
The ratio
Does heritability-based choice of
If we use
Compare to $\lambda_{BLUP}=p*(1-h^2_{SNP})/h^2_{SNP}=p*(1-p/mh^2)/(p/mh^2)=m/h^2-p$
Exactly the same one! BLUP is the natural extension of HKB!
- Cross-validation (on a smaller sample size):
It could be computationally implausible even on a sample size of 20000 when p~ 10000.
- Bayesian perspective: Lawless & Wang 1976 paper
-
Cule et al 2011, 2012 paper2012 paper2011in Ridge R Package that is based on PCA of X, similar to HKB method but in an adaptive way so that it works for p>n.
We found in order to obtain the best predictive performance a relaxed penalty was necessary. CV to choose the parameters for the elastic net showed that that more than half of the time the alpha parameter was set to zero, which makes EN equivalent to RR [Friedman et al., 2010]. The CV choice of the parameter that controls the amount of shrinkage of the regression coefficients resulted in a large parameter being chosen, implying strong shrinkage. Together, the CV choice of EN parameters results in the equivalent of an RR with the regression coefficients shrunk close to zero. This reflects the structure of the data with many predictors of small effect. RR‐CV and RR‐k offer approximately equivalent predictive performance, but RR-k is computationally more straightforward.
-
AIC, BIC
http://www.stat.yale.edu/~mjk56/temp/bigmemory-vignette.pdf
What is the relationship between c and
In the orthonormal case, X′X=I
Therefore Lagrangian multiplier:
KKT condition for it:
c = s = constraint = sum of ridge beta estimate
where the
shinkage:
MTBLUP LDpred
R package gives threshold on
Main challenge: algebra of large matrix
- Background setting
-
R package lmRidge for fitting
-
n=10k, p=1000 (from chr22) , 965 remained after MAF filtering 0.01% (exclude SNPs that are with too small MAF), phenotype=height
-
lambda=0.1 (arbitrarily given)
-
LD is troublesome for this package. Need to do some kind of LD clumping~~
After that for all remaining SNPs LD<=threshold=0.9, and 884 SNPs pass it.
-
Write an R function to calculate p value, and as v is large, we approximately have T
N(0,1) under H0~. -
Results: p-values for all SNPs that are fitted simultaneously by a Ridge model
Computation only takes several minutes.
min(pval) [1] 0.0001160818
summary(pval)
Min. 1st Qu. Median Mean 3rd Qu. Max 0.0001161 0.2383228 0.4787446 0.4910669 0.7479664 0.9977548 A little bit smaller than uniform distribution?
Comment:
The results show consistency as it is not far from expected null hypothesis. This means that the statistical modelling procedure is not wrong! Also QQ plot shows that it has some power. The next step is to ensure the testing is powerful enough. And I should further investigate the p-value distribution given different lambda and power/FPR analysis using different p-value threshold in a simulation study.
sample size n = 10k
SNP number p = 1000 , the first 1000 SNPs of chromosome 22
using genotyped SNPs, some remaining holes imputed by median
-
964 SNPs remained after filtering those SNPs without any variation for these 10k people (exclude SNPs that are with too small MAF--too rare)
-
Strong LD could be troublesome for ridge computation. Need to do some kind of LD clumping~~
After that for all remaining SNPs LD<=threshold=0.9, and 883 SNPs pass it.
a linear model for this very small genome
trait number r=1
Note: For each k,
causal SNP proportion
causal SNP number
heritability
In total 25 simulations for a method
**In each simulation, select a
For a fixed heritability, changing
Using
penalty lambda=0.1 (arbitrarily given)
Parameters P=(h^2^,
QQ plot for E and A
QQ plot for E and A: ridge regression
QQ plot for E and A: per SNP GWAS for comparison---less powered
Tentative conclusion
- In general, higher heritability and higher per SNP heritability, better performance by ridge regression. Compared to GWAS that finds nothing under Bonferroni correction, ridge regression is much better.
- In terms of FPR, p-value thresholding after ridge regression seems to control it very well even at a non-stringent cutoff.
- When per SNP heritability > 0.01 (1%), ridge regression has considerable power (>50%) , except when heritability is as low as 0.1. And there seems to be certain "good" cutoff that can control FDR pretty well whilst retain adequate power, which lies between 10^-3^
10^-4^. (this should be related to NSNP~?) This is close to Bonferroni correction. - When per SNP heritability goes below 0.005 (0.5%), true signals may be messed up with noise, so we see that power drops quite rapidly, suggesting that for ridge regression, there might be some obvious "detectability bound" on the signal intensity (beta effect size). However, we see in some cases that when h^2^ is as high as 0.9, SNPs with per SNP h^2^ as small as 0.001~0.002 is still detectable.
- If we consider whole genome, 0.5% per SNP heritability for a h^2^ =60% trait corresponds to 120 SNPs, then true
$\pi$ =120/10^6^ ~10^-4^ ~ 0.01%, very small! That suggests polygenic disease would be challenging, from the aspect of whole-genome/chromosome wide variable selection. However, if we know that on a small region there are enriched heritability (causal SNPs), then the 1% per SNP heritability may be satisfied. A special advantage of ridge compared to per SNP GWAS is that simultaneous fitting takes LD into consideration.
BH procedure: in simulation, compare empirical BH FDR vs true FDR
empirical ones vs theoretical one
replications
recheck GWAS results
illustrate point 5
large p, at least similar n/p ratio to real data's one
calculate a single index to make the comparison easier? For example, the number of false positives when the significance p-value cutoff is such that half the true associations are significant.
scale up the simulation to real data scale (whole chromosome, n>100k)
compare lambda_formula given vs lambda CV
- similar n/p ratio to real data's one
in UKBB we have n=500k,p~630k, maybe let's assume p/n→1
Here we will use the first 1000 SNPs of chromosome 22 , so now we first use n~1k
- using genotyped SNPs
- some remaining holes imputed by median
- filtering those SNPs without any variation for these 10k people(exclude SNPs that are with too small MAF--too rare) ---> p=964
- Strong LD could be troublesome for ridge computation. Need to do some kind of LD clumping~~After that for all remaining SNPs LD<=threshold=0.9 ----> p=883
- take n=883, so p/n=1
Bingxin Zhao and Hongtu Zhu. Cross-trait prediction accuracy of high-dimensional ridge-type estimators in genome-wide association studies, 2019.
[Dobriban and Wager, 2018].
Applying the result from this paper, we will have
vs.
optimal
optimal
compare lambda: from 0.01, to formula given lambda=1/3, to CV lambda:=12, to BLUP lambda:=1000*1/3=333
black line: formula given,
red line: CV
If we look at panel B, cutoff=0.01 for p_value, giving highest power when controlling FWER<5% (panel A red line FWER>5% ) (edited)
Formula-given lambda (black line,0.25/0.75=1/3) are approximately the smallest lambda that gives similar performance as CV, given the best cutoff.
BLUP lambda will be ~ 1000*1/3=333.3
Review: the predictors we used include all the causal SNPs. So