Read the final paper on Google Drive
Human height is a complex polygenic trait that has been difficult to predict due to the influence of potentially thousands of interacting genetic loci. Genetics accounts for approximately 80% of a person height (the other 20% is nutrition). Despite the overwhelming influence of hereditary factors, there are currently no genetically based models used to predict a person’s height.
Here we present a predictive model (using Genome Wide Association Study and Support Vector Machine) to unravel this genetic enigma. The results of this study are paramount to the practice of pediatric endocrinology and embryonic screening. This novel approach serves as a proof of concept for classification of other polygenetic traits and diseases.
GWAS is a biostatistic approach for interrogating SNPs that commonly arise in populations, and determining whether they are associated with a disease or phenotypic trait.
In this project, GWAS was a linear regression of the principal components. (See Method Overview for more details)
Geneticists and Epidemiologists have used GWAS to study genetic variations in diseases such as asthma, cancer, diabetes, heart disease, among others.
The GWAS Workflow is Outlined in the Figure Below
Read the final paper for more details. Although not explicitly stated here, a multiple hypothesis test correction was applied in key steps below.
- Downloaded SNP Array Data of 313 individuals from OpenSNP.
- Perl was used to parse, clean, and wrangle the raw SNP array data into a genotype matrix
- Genotype and phenotype matrices were imported into R
- Checked Linear Regression Model Assumptions
- Performed computationally intensive filtering protocol (Hardy Weinberg Equilibrium Chi Square Test) to ensure SNP genomic qualiy
- Alleles were encoded into numeric values based on the major and minor allele frequency
- Created additive genotype (Xa) and dominant genotype (Xd) matrices as required by EIGENSTRAT
- EIGENSTRAT was used to perform a Principal Component Analysis which corrects for population-stratification
- The first 10 principal components were used to fit a linear regression model to the reported heights of the individuals
- An ANOVA likelihood test was applied to the fitted linear regression model to obtain p-values for each SNP
- The p-values were used to generate a QQ-plot and Manhattan Plot. Evaluate Results
- Obtained a list of the top 500 SNP in order of statistical significance (p-value)
- The alleles associated with the significant SNPs were mapped to the Euclidian Space and assigned a score (see Paper for an explanation of this step). These values were scaled and normalized
- k-means clustering (k=2 and k=3) was used to assign individuals into 2 (binary) and 3 (multi) height classes.
- The goal here was to stratify the height classes into relatively equal clusters based on variance
- Split data into training and validation set
- Trained a Support Vector Machine for Classification
- Used the list of p-prioritized SNPs and allele encoded values to make predictions
- 10-fold cross validation to identity optimal parameters
- Evaluate model performance on Test Set
The results below reflect the best performing model after a 10-fold cross validation
- The binary classification (tall vs short) model achieved a predictive accuracy of 86%
- The multi-class classification achieved a predictive accuracy of 72%
- only 313 individuals, low statistical power
- unable to account for common covariates; individuals were missing AGE and SEX
- Height is a sexually dimorphic trait; on average men are taller than women in all human populations
- knowing this information would contribution to the predictive accuracy
-
obtain SNP array data from more individuals
-
examine data for Linkage Disequilibrium (LD)
- Regional Gene Plot to Visualize LD
- Haplotype Analysis
- Examine for Epistasis