ldscsim
is a module written with Hail to simulate phenotypes. It is designed to test LD score regression, but is extensible to general use.
- Model descriptions
- Examples
yi = ∑j Xijβj + εi
- yi : Phenotype of individual i
- Xij : Genotype of individual i at SNP j
- βj : Effect size of SNP j
- εi : Environmental noise for individual i
The phenotype of each individual is calculating the dot product of that individual's genotypes with the SNP effects and then adding random environmental noise.
Note: Genotypes as described above are normalized such that genotypes at a given SNP have mean=0, var=1.
Relevant functions:
simulate_phenotypes()
calculate_phenotypes()
β ~ N(0, h2/M)
- h2 : SNP-based heritability of phenotype
- M : Number of SNPs in simulation
All SNP effects are drawn from a normal distribution with mean=0, variance= h2/M, where h2 is the desired heritability of the simulated trait and M is the number of SNPs in the genotype matrix given by the user.
Relevant functions:
make_betas()
multitrait_inf()
β = N(0, h2/(πM)), w/ probability π
β = 0, otherwise
- h2 : SNP-based heritability of phenotype
- M : Number of SNPs in simulation
- π : Probability of a SNP being causal
SNPs are assigned to be causal with probability π. If a SNP is causal, its effect size is drawn from the normal distribution with mean=0 and variance=h2/(πM). If a SNP is not causal it has an effect size of 0.
Relevant functions:
make_betas()
multitrait_ss()
βj = N(0, ajh2/(∑jaj))
aj = ∑C τC aCj
- h2 : SNP-based heritability of phenotype
- aj : Annotation for SNP j summed across categories
- τC : Per-SNP heritability for annotation category C
- ajC : Annotation for SNP j in category C
The effect for SNP j are drawn from a normal distribution with mean=0, variance=a[j]·h2/(Var(a)), where a[j] is the relative heritability contributed by SNP j and a is a vector of the relative heritability contributed by each SNP. a[j] is calculated by taking the linear combination across all annotation categories for SNP j, scaling each annotation by coefficient τC1/2, where τC is the per-SNP heritability for annotation category C.
Relevant functions:
make_betas()
multitrait_inf()
agg_fields()
get_coef_dict()
yi = ∑j Xijβj + ϵi + si
si = ∑k σksik
- yi : Phenotype of individual i
- Xij : Genotype of individual i at SNP j
- βj : Effect size of SNP j
- ϵi : Environmental noise for individual i
- si : Population stratification term for individual i
- σk : Square root of variance contributed by covariate k
- sik : Covariate k for individual i
After calculating the matrix product of genotypes and SNP effects, it is possible to add population stratification. Population stratification is a term added to the phenotype, which is the linear combination of covariates scaled by coefficients provided by the user.
Relevant functions:
calculate_phenotypes()
agg_fields()
get_coef_dict()
Β = (β1,...,βk)T
Β ~ Nk(0, Ω)
- βk : SNP effect sizes for trait k
- k : Number of correlated traits
- Ω : Variance-covarance matrix, k x k
SNP effects are drawn from a multivariate normal distribution with mean=0 and variance-covariance matrix defined by heritability and genetic correlation values. If the specified heritability and genetic correlation values result in a covariance matrix that is not positive semi-definite the framework will adjust the genetic correlation values to make the covariance matrix positive semi-definite.
Relevant functions:
make_betas()
multitrait_inf()
get_cov_matrix()
_nearpsd()
(βjA0, βjB0) ~ N(0, ΩSS)
ΩSS = [[1/(pTT+pTF), rg/pTT],
[rg/pTT, 1/(pTT+pFT)]]
βjA = βjA0 and βjB = βjB0, w/ probability pTT
βjA = βjA0 and βjB = 0, w/ probability pTF
βjA = 0 and βjB = βjB0, w/ probability pFT
- βjA0 : Initial effect size of SNP j for trait A
- βjB0 : Initial effect size of SNP j for trait B
- ΩSS : Spike & slab variance-covariance matrix
- pTT : Probability of a SNP being causal for both traits
- pTF : Probability of a SNP being causal for trait A but not trait B
- pFT : Probability of a SNP being causal for trait B but not trait A
- rg : Genetic correlation between traits A and B
- βjA : Final effect size of SNP j for trait A
- βjB : Final effect size of SNP j for trait B
SNP effects are drawn from bivariate normal distribution and then set to be causal or non-causal based on parameters defining probability of being causal.
Note: Correlated two-trait spike & slab model is from page 30 of the supplement of Turley et al. 2018
Relevant functions:
make_betas()
multitrait_ss()
[see docstrings for more information]
Relevant functions:
binarize()
ascertainment_bias()
simulate_phenotypes()
is the main method wrapping other methods in the package. However, all methods are self-contained and thus can be run independently.
Assume for the examples of the infinitesimal and spike & slab models that we have the following MatrixTable mt
:
>>> mt.describe()
----------------------------------------
Global fields:
None
----------------------------------------
Column fields:
's': str
----------------------------------------
Row fields:
'rsid': str
----------------------------------------
Entry fields:
'gt': int32
----------------------------------------
Column key: ['s']
Row key: ['rsid']
----------------------------------------
mt.s
: Sample IDsmt.rsid
: SNP IDsmt.gt
: Genotypes of individuals
Simulation specifications:
- Single trait
- Infinitesimal model
- Heritability = 0.1
>>> sim = simulate_phenotypes(mt=mt,genotype=mt.gt,h2=0.1)
>>> sim.describe()
----------------------------------------
Global fields:
'ldscsim': struct {
h2: float64
}
----------------------------------------
Column fields:
's': str
'y_no_noise': float64
'y': float64
----------------------------------------
Row fields:
'rsid': str
'beta': float64
----------------------------------------
Entry fields:
'gt': int32
'norm_gt': float64
----------------------------------------
Column key: ['s']
Row key: ['rsid']
----------------------------------------
sim.beta
are the SNP effects for the simulated trait and have the approximate distribution N(0,h2
/M), where M is the number of SNPs.sim.y_no_noise
is equivalent to the dot product of the genotype matrix by the vector of SNP effects and is modeled to be distributed as N(0,h2
).sim.y
is the phenotype after adding random environmental noise. Random environmental noise is scaled to be distributed as N(0,1-h2
). They_no_noise
and the random environmental noise are independent, therefore their sum will have variance equal to the sum of their variances:h2
+ (1-h2
) = 1. Both phenotype and the random noise have mean equal to zero, therefore the mean ofy
is also 0. Thus,y
is modeled to have distribution N(0,1), just like any real phenotype that has been standardized.sim.norm_gt
is the normalized genotype used to calculate the phenotype. Genotypes are normalized such that at a given SNP, the distribution of genotypes across individuals is approximately N(0,1).sim.ldscsim.h2
is theh2
parameter passed to the simulation.
To check heritability of simulated trait:
>>> hl.eval(sim.ldscsim.h2) # the expected h2 passed as a parameter
0.1
>>> sim.aggregate_cols(hl.agg.stats(sim.y_no_noise)).stdev**2 # calculating observed h2
0.09413353719961155
Simulation specifications:
- Multi-trait
- Infinitesimal model
- Trait 1 heritability = 0.3
- Trait 2 heritability = 0.4
- Genetic correlation = 0.6
>>> sim = simulate_phenotypes(mt=mt, genotype=mt.gt, h2=[0.3, 0.4], rg=0.6)
>>> sim.describe()
----------------------------------------
Global fields:
'ldscsim': struct {
h2: array<float64>,
rg: float64
}
----------------------------------------
Column fields:
's': str
'y_no_noise': array<float64>
'y': array<float64>
----------------------------------------
Row fields:
'rsid': str
'beta': array<float64>
----------------------------------------
Entry fields:
'gt': int32
'norm_gt': float64
----------------------------------------
Column key: ['s']
Row key: ['rsid']
----------------------------------------
Compared to the previous simulation, the main changes are that beta
,y_no_noise
, and y
are all arrays because they hold values for multiple traits.
>>> sim.cols().show()
+-----------+-----------------------+-----------------------+
| s | y_no_noise | y |
+-----------+-----------------------+-----------------------+
| str | array<float64> | array<float64> |
+-----------+-----------------------+-----------------------+
| "1002230" | [3.74e-02,9.02e-01] | [1.99e+00,1.44e-01] |
| "1006422" | [-1.52e+00,-1.07e+00] | [-1.06e+00,-9.58e-01] |
| "1011645" | [-4.12e-01,-7.80e-01] | [-2.21e-01,-2.42e-01] |
| "1014723" | [7.33e-01,6.17e-01] | [1.53e+00,-7.43e-01] |
| "1021993" | [2.58e-01,-2.40e-01] | [-6.44e-02,5.31e-01] |
| "1022263" | [7.90e-01,8.05e-01] | [2.34e+00,3.04e-01] |
| "1029991" | [8.99e-02,1.87e-01] | [-7.68e-01,5.34e-01] |
| "1034880" | [-8.72e-02,-1.08e-01] | [-5.53e-01,-1.39e+00] |
| "1040108" | [1.03e-02,-8.61e-01] | [2.39e-01,-7.03e-01] |
| "1046474" | [-3.16e-01,1.34e-01] | [-1.54e+00,9.61e-01] |
+-----------+-----------------------+-----------------------+
Each field of arrays is indexed by trait. For instance, mt.y_no_noise[0]
is the y_no_noise
field for the first trait and mt.y[0]
is the y
field corresponding to the first trait, and mt.y_no_noise[1]
is the y_no_noise
field for the second trait and mt.y[1]
is the y
field corresponding to the second trait. The same rule applied for mt.beta[0]
and mt.beta[1]
.
To check the heritabilities and genetic correlations between traits:
>>> hl.eval(sim.ldscsim.h2) # the expected h2 values passed as parameters
[0.3, 0.4]
>>> [sim.aggregate_cols(hl.agg.stats(sim.y_no_noise[x])).stdev**2 for x in range(2)] # calculating observed h2
[0.30562739116200704, 0.43268586014716076]
Simulation specifications:
- Multi-trait
- Infinitesimal model
- Trait 1 heritability = 0.1
- Trait 2 heritability = 0.2
- Trait 3 heritability = 0.7
- Genetic correlation trait 1 & trait 2 = 0.8
- Genetic correlation trait 1 & trait 3 = 0.5
- Genetic correlation trait 2 & trait 3 = 0.4
>>> sim = simulate_phenotypes(mt=mt, genotype=mt.gt, h2=[0.1, 0.2, 0.7], rg=[0.8, 0.5, 0.4])
>>> sim.describe()
----------------------------------------
Global fields:
'ldscsim': struct {
h2: array<float64>,
rg: array<float64>
}
----------------------------------------
Column fields:
's': str
'y_no_noise': array<float64>
'y': array<float64>
----------------------------------------
Row fields:
'rsid': str
'beta': array<float64>
----------------------------------------
Entry fields:
'gt': int32
'norm_gt': float64
----------------------------------------
Column key: ['s']
Row key: ['rsid']
----------------------------------------
This produces a similar MatrixTable to the previous simulation. However, all array fields have three elements rather than two.
>>> sim.cols().show()
+-----------+---------------------------------+---------------------------------+
| s | y_no_noise | y |
+-----------+---------------------------------+---------------------------------+
| str | array<float64> | array<float64> |
+-----------+---------------------------------+---------------------------------+
| "1002230" | [-3.47e-01,-4.53e-01,1.09e-01] | [-1.50e+00,2.08e-01,7.32e-01] |
| "1006422" | [-3.44e-01,-4.23e-01,3.10e-01] | [-8.73e-01,7.69e-01,5.73e-01] |
| "1011645" | [-2.09e-01,-4.78e-01,-5.55e-01] | [-6.01e-01,-5.94e-01,-6.69e-01] |
| "1014723" | [5.52e-01,1.72e-01,-3.88e-01] | [1.44e+00,3.07e-01,-1.52e-01] |
| "1021993" | [2.89e-02,-1.43e-01,-1.12e-01] | [-4.30e-01,-1.93e+00,-6.49e-03] |
| "1022263" | [6.97e-03,2.57e-01,2.13e-01] | [-1.25e+00,1.12e+00,1.99e-01] |
| "1029991" | [-2.84e-01,-3.34e-01,5.57e-01] | [6.09e-01,-1.91e+00,2.34e-01] |
| "1034880" | [-1.87e-01,3.98e-02,-5.90e-01] | [-1.28e+00,-2.08e+00,-3.99e-01] |
| "1040108" | [4.09e-01,3.11e-01,-5.32e-01] | [1.70e+00,8.05e-01,-1.07e+00] |
| "1046474" | [1.39e-01,-1.63e-01,2.44e-01] | [1.26e-01,2.37e-01,1.22e-01] |
+-----------+---------------------------------+---------------------------------+
Simulation specifications:
- Single trait
- Spike & slab
- Heritability = 0.1
- Probability of a SNP being causal (
pi
) = 0.01
>>> sim = simulate_phenotypes(mt=mt, genotype=mt.gt, h2=0.1, pi=0.01)
>>> sim.describe()
----------------------------------------
Global fields:
'ldscsim': struct {
h2: float64,
pi: float64
}
----------------------------------------
Column fields:
's': str
'y_no_noise': float64
'y': float64
----------------------------------------
Row fields:
'rsid': str
'beta': float64
----------------------------------------
Entry fields:
'gt': int32
'norm_gt': float64
----------------------------------------
Column key: ['s']
Row key: ['rsid']
----------------------------------------
Simulation specifications:
- Two traits
- Spike & slab
- Trait 1 heritability = 0.8
- Trait 2 heritability = 0.9
- Genetic correlation = 0.5
- Probability a SNP is causal for both traits = 0.3
- Probability SNP is causal for trait 1 but not trait 2 = 0.1
- Probability SNP is causal for trait 2 but not trait 1 = 0.2
>>> sim = simulate_phenotypes(mt=mt, genotype=mt.gt, h2=[0.8, 0.9], pi=[0.3, 0.1, 0.2], rg =0.5)
Assume for this example we begin with the original MatrixTable shown above but we add new fields that represent annotations:
>>> mt1 = mt.annotate_rows(a1 = hl.rand_bool(p=0.01),
a2 = hl.rand_bool(p=0.05),
a3 = hl.rand_norm())
>>> mt1.describe()
----------------------------------------
Global fields:
None
----------------------------------------
Column fields:
's': str
----------------------------------------
Row fields:
'rsid': str
'a1': bool
'a2': bool
'a3': float64
----------------------------------------
Entry fields:
'gt': int32
----------------------------------------
Column key: ['s']
Row key: ['rsid']
----------------------------------------
a1
,a2
,a3
: Annotations we want to use for an annotation-informed model
We can aggregate those annotations as a linear combination using default coefficients to scale each field before summing across fields.
>>> mt2 = agg_fields(tb=mt1, str_expr='a', axis='rows') # aggregate across row fields using default coefficients (all equal to 1)
Assuming coef = 1 for all annotations
Fields and associated coefficients used in annot aggregation: {'a1': 1, 'a2': 1, 'a3': 1}
We can also use a dictionary to specify coefficients to use for each row field when aggregating as a linear combination.
>>> coef_dict = {'a1':0.3, 'a2':0.1, 'a3':0.4} # coefficients to multiply each field by before summing
>>> mt2 = agg_fields(tb=mt1, coef_dict=coef_dict, axis='rows') # aggregate across row fields, specifying coefficients
Fields and associated coefficients used in annot aggregation: {'a1': 0.3, 'a2': 0.1, 'a3': 0.4}
In either case, the output MatrixTable will be:
>>> mt2.describe()
----------------------------------------
Global fields:
None
----------------------------------------
Column fields:
's': str
----------------------------------------
Row fields:
'rsid': str
'a1': bool
'a2': bool
'a3': float64
'agg_annot': float64
----------------------------------------
Entry fields:
'gt': int32
----------------------------------------
Column key: ['s']
Row key: ['rsid']
----------------------------------------
Note that when using agg_fields()
, specifying str_expr
but not coef_dict
can unintentionally include fields in the aggregation with names that match str_expr
. For instance, if we try using mt2
as our input MatrixTable:
>>> mt3 = agg_fields(tb=mt2, str_expr='a', axis='rows')
Assuming coef = 1 for all annotations
Fields and associated coefficients used in annot aggregation: {'a1': 1, 'a2': 1, 'a3': 1, 'agg_annot': 1}
Because our str_expr
matches the field agg_annot
it also includes it in the fields to be aggregated. To avoid this, either rename the desired target fields to have a more unique pattern to search for, or drop agg_annot
.
See docstring of make_betas()
for more information
[In development]
mt.PC1
,mt.PC2
,mt.PC3
: Covariates we want to use for population stratification