Skip to content

Latest commit

 

History

History
159 lines (106 loc) · 8.15 KB

README.md

File metadata and controls

159 lines (106 loc) · 8.15 KB

animated

HeriVar

Quantifying the combined heritability of a trait based on a multi-ethnic LD panel with equal distribution of samples among each ancestry group.

Table of Contents

Background

Heritability of a trait is often identified and reported in an ancestry group stratified manner. This limits the ability to estimate and report the combined heritability in a multi-ethnic population. Although there are several methods demonstrated recently with robust ways of calculating heritability with or without individual-level datasets, these methods are limited to ancestry-specific groups. In this project, we are proposing a way to calculate combined heritability using a multi-ethnic reference linkage-disequilibrium (LD) panel with equal proportions of data. We will use current existing tools to simulate and calculate heritability and report it as a framework that can be implemented and explored further. This will lead to the development of a novel approach to estimating the heritability of particular traits in multi-ethnic populations. As a part of Team HeriVar, you will be contributing to the demonstration of methodology, calculation of heritability, and work as a team to promote the method.

With the increasing availability of multi-ethnic whole genome sequence datasets, there is a gaping absence of an approach to estimate the heritability of particular phenotypic trait that accounts for the multi-ethnic genetic architecture. This approach of calculating the combined multi-ethnic heritability has not been pursued previously. This project helps us understand the problems facing this issue in the field of genomics and helps in generating a framework using existing tools to calculate and assess the heritability of a trait in multi-ethnic populations.

Data

Tools

Process

Dependencies

  • LDSC requires Anaconda3 or Python-2.7 and subpackages like bitarray, nose, pybedtools, scipy, numpy, pandas, bioconda. (will be installed when generating environment).
  • SumHer uses Intel MKL Libraries as dependencies. ( module load imkl/2020.1.217-iimpi-2020a )

Installation

  • LDSC ( Required to be installed by everyone in their home directory to use it )

    • Clone the github of ldsc (git clone https://github.com/bulik/ldsc.git) and cd into the folder
    • Module load Anaconda3 ( module load Anaconda3 )
    • Install dependencies using conda as suggested by github ( conda env create --file environment.yml )
    • Activate ldsc ( source activate ldsc )
    • Test installation by running python scripts shared as path of repo ( ./ldsc.py -h )
  • Sumher

    • Download the LDAK Linux executable file by requesting using name and email ( you will get an email from the developer with downloadables if you are a first time user )
    • Unzip the executable file and use it. ( /data/project/ubrite/hackathon2022/staging_area_teams/HeriVar/Tools/ldak5.2.linux - It can be accessible by everyone)
    • It also have executable for MAC users. Note: Please check Dependencies before installing the tools.
  • LiftOver

animated

Results

  • Datasets
    • We downloaded 1000g high coverage reference dataset from http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data_collections/1000G_2504_high_coverage/working/20201028_3202_phased/.

    • We then extracted individuals files and randomly chose 489 unrelated individuals among each ancestry group.

    • Rationale behind including sample individuals from multiple ancestry groups is by taking equal number of individuals, we can have equal ld pattern distribution among the individuals.

    • Admixed population were excluded from the analysis along with related individuals which to 1956 individuals.

    • We removed variants with less than 1% minor allele frequency and variants with more than 5% missing data.

                               Allele Frequency Distribution among each ancestry and overall.
      

animated

animated

  • PCA Analysis
    • We used Plink to calculate principal compnents analysis to test whether we have equal distributions of samples per ancestry group.

                                         PC distributions stratified by Ancestry
      

animated

animated

  • Prunning & Thresholding
    • After subsetting to sample of interest, we did prunning and thresholding based on different cutoffs.
    • Plink is used to generate the files needed.
    • We used R2 and window size parameters for analysis.
      • R-squared cutoff of 0.2, 0.4, 0.6, 0.8.

      • Window size of 250kb, 500kb, 1Mb, 10Mb.

                                        Distribution of Variants after P + T
        

animated

  • We had ran near 1000 jobs for generating this datasets in Cheaha.

  • We decided to exclude High LD regions as recommended by the tools.

  • We subsetted the datasets to two categories.

    • Pre HighLD regions removal.
    • Post Hight LD regions removal.
  • Refernces panel generation

    • We used the two categories as mentioend above and used two tools to calculated reference LD panels.

    • We used ldsc to generate LD scores for all the categories we have.

                                          LD_scores Distribution for Chromosome 22
      

  • For LDAK annotations, We used liftover to convert blk annotations from grch37 to grch38 and working on generting tagging files

    • We had an issue generating LDAK annotations files and decided to pursue analysis after hackathon.
  • Phenotypes Processing

    • We have also worked on processing phenotypes based as suggested by the tools.
  • Heritability

    • We tried to generate h2 values using LDAK & LDSC but couldnt able to complete because of last minute issues.

Team Members