GWAS and Height Classification

Boston University • ENG BE 562: Computational Biology • Samuel Moijueh, Demarcus Briers • 12/2014

Read the final paper on Google Drive

Motivation

Human height is a complex polygenic trait that has been difficult to predict due to the influence of potentially thousands of interacting genetic loci. Genetics accounts for approximately 80% of a person height (the other 20% is nutrition). Despite the overwhelming influence of hereditary factors, there are currently no genetically based models used to predict a person’s height.

Here we present a predictive model (using Genome Wide Association Study and Support Vector Machine) to unravel this genetic enigma. The results of this study are paramount to the practice of pediatric endocrinology and embryonic screening. This novel approach serves as a proof of concept for classification of other polygenetic traits and diseases.

What is a Genome Wide Association Study (GWAS)?

GWAS is a biostatistic approach for interrogating SNPs that commonly arise in populations, and determining whether they are associated with a disease or phenotypic trait.

In this project, GWAS was a linear regression of the principal components. (See Method Overview for more details)

Geneticists and Epidemiologists have used GWAS to study genetic variations in diseases such as asthma, cancer, diabetes, heart disease, among others.

The GWAS Workflow is Outlined in the Figure Below

Method Overview

Read the final paper for more details. Although not explicitly stated here, a multiple hypothesis test correction was applied in key steps below.

Genomic Association

Downloaded SNP Array Data of 313 individuals from OpenSNP.
Perl was used to parse, clean, and wrangle the raw SNP array data into a genotype matrix
Genotype and phenotype matrices were imported into R
Checked Linear Regression Model Assumptions
Performed computationally intensive filtering protocol (Hardy Weinberg Equilibrium Chi Square Test) to ensure SNP genomic qualiy
Alleles were encoded into numeric values based on the major and minor allele frequency
Created additive genotype (Xa) and dominant genotype (Xd) matrices as required by EIGENSTRAT
EIGENSTRAT was used to perform a Principal Component Analysis which corrects for population-stratification
The first 10 principal components were used to fit a linear regression model to the reported heights of the individuals
An ANOVA likelihood test was applied to the fitted linear regression model to obtain p-values for each SNP
The p-values were used to generate a QQ-plot and Manhattan Plot. Evaluate Results
Obtained a list of the top 500 SNP in order of statistical significance (p-value)
The alleles associated with the significant SNPs were mapped to the Euclidian Space and assigned a score (see Paper for an explanation of this step). These values were scaled and normalized

Height Class Stratification

k-means clustering (k=2 and k=3) was used to assign individuals into 2 (binary) and 3 (multi) height classes.
- The goal here was to stratify the height classes into relatively equal clusters based on variance

Support Vector Machine

Split data into training and validation set
Trained a Support Vector Machine for Classification
- Used the list of p-prioritized SNPs and allele encoded values to make predictions
10-fold cross validation to identity optimal parameters
Evaluate model performance on Test Set

Results

The results below reflect the best performing model after a 10-fold cross validation

The binary classification (tall vs short) model achieved a predictive accuracy of 86%
The multi-class classification achieved a predictive accuracy of 72%

Limitations

only 313 individuals, low statistical power
unable to account for common covariates; individuals were missing AGE and SEX
- Height is a sexually dimorphic trait; on average men are taller than women in all human populations
- knowing this information would contribution to the predictive accuracy

Future Direction

obtain SNP array data from more individuals
examine data for Linkage Disequilibrium (LD)
- Regional Gene Plot to Visualize LD
- Haplotype Analysis
- Examine for Epistasis

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
GWAS_results		GWAS_results
SVM		SVM
ppt_presentation		ppt_presentation
FixStrandness.R		FixStrandness.R
GWAS-and-Height-Classification_FinalReport.pdf		GWAS-and-Height-Classification_FinalReport.pdf
GWAS_313_finalproject.R		GWAS_313_finalproject.R
README.md		README.md
import_files.RData		import_files.RData
master_list.txt		master_list.txt
snpArray_to_GWAS.pl		snpArray_to_GWAS.pl

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

GWAS and Height Classification

Boston University • ENG BE 562: Computational Biology • Samuel Moijueh, Demarcus Briers • 12/2014

Motivation

What is a Genome Wide Association Study (GWAS)?

Method Overview

Genomic Association

Height Class Stratification

Support Vector Machine

Results

Limitations

Future Direction

About

Releases

Packages

Languages

smoijueh/GWAS-and-Height-Classification

Folders and files

Latest commit

History

Repository files navigation

GWAS and Height Classification

Boston University • ENG BE 562: Computational Biology • Samuel Moijueh, Demarcus Briers • 12/2014

Motivation

What is a Genome Wide Association Study (GWAS)?

Method Overview

Genomic Association

Height Class Stratification

Support Vector Machine

Results

Limitations

Future Direction

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages