Kymberleigh Pagel, Johns Hopkins University, Baltimore MD 21218, kpagel1@jhu.edu
Anna Chang, Johns Hopkins University, Baltimore MD 21205, achang44@jhmi.edu
Zhi Liu, NIH, Bethesda MD 20892, zhi.liu@nih.gov
Summer Rankin, Booz Allen Hamilton, Rockville MD 20852, rankin_summer@bah.com, summer.rankin@fda.hhs.gov
Danielle Rubin, NIH, Bethesda MD 20892, danielle.rubin@nih.gov
Chris Shin, NIH/NIAID, Bethesda MD 20892, chris.shin@nih.gov
Open-CRAVAT is a python package that performs genomic variant interpretation. The modular and locally-installed command-line or GUI interface allows for annoations of gene- and variant-level impact, interactions, conservation, and scoring. In this work, we advance the platform by the addition of new annotation sources to assist in the interrogation of genetic variation.
Example of the Open-CRAVAT interface
As of May 2019, the Open-CRAVAT Store has 65 annotators, converters, and visualization widgets available for use. Annotators range from genomic feature insights to population-level frequencies and clinical interpretations. Convertors take input in forms other than VCF (Variant Cell Format), such as dbSNP rsid and 23andMe formats, and make them compatible to Open-CRAVAT. There are currently three visualization widgets that can be downloaded: a summary of the top genes ranked by total variants, a haplotype information widget, and an embedded interactive genome visualization component developed by Integrative Genomics Viewer (IGV). Items in the store range from 2 KB to 29 GB, and they can be individually downloaded according to the user's research interests.
Home Page for the Open-CRAVAT store
Data Sources Accessible through Open-CRAVAT | Type of Data |
---|---|
1000 Genomes Project | allele frequencies |
The Cancer Genome Atlas | genome-wide chromatin accessibility profiles of tumor samples |
BioGRID | gene interactions |
BRCA1 Multiplex Assay | functional scores for SNVs |
Cancer Gene Census (CGC) | gene level data |
Cancer Gene Landscape | oncogenes and suppressor genes |
CIViC | clinical interpretation of variants in cancers |
ClinVar | relationships among human variations and phenotypes |
COSMIC | somatic mutations in cancer |
dbSNP | single nucleotide substitutions and short deletion and insertion polymorphisms |
denovo-db | germline de novo variants |
NHLBI GO Exome Sequencing Project (ESP) | exome variants |
Essential Genes | genetic variation and mutational burden in human orthologs |
ExAC Gene and CNV | probability of loss-of-function intolerance/intolerance |
Flanking Sequence | reference and alternate sequences for flanking bases |
Gene Ontology | functions of genes and gene products |
GHIS | haploinsufficiency scores |
gnomAD | exome and genome sequencing data |
GRASP | genome-wide associations between SNPs and phenotypes |
GTEx | correlations between genotype and tissue-specific gene expression levels |
IntAct | molecular interaction data |
Mutation Assessor | prediction of functional impact of amino-acid substitutions in proteins |
NCBI Gene | gene descriptions |
ncRNA | non-coding RNA at variant location |
NDEx | biologic pathways |
p(HI) | gene-based haploinsufficiency predictions |
P(rec) | rare and likely deleterious loss-of-function alleles |
phastCons | phastCons scores for multiple alignments |
phyloP | conservation scoring for multiple alignments |
Promoter IR | interacting regions of promoters |
Pseudogene | annotations generated by GENCODE project |
PubMed | number of PubMed articles for a given gene |
Repeat Sequences | annotation of repeat regions |
RVIS | variation intolerance scoring |
SiPhy | conservation scores based on mammal genomes |
TARGET | genes directly linked to clinical action |
UK10K Cohorts | genetic information from 2 twin studies |
UniProt | protein sequence and annotation data |
VISTA Enhancer Browser | experimentally validated enhancers |
Analysis Tools Available through Open-CRAVAT | Function |
---|---|
CHASMplus | classification of missense mutations as drivers or passengers in human cancers |
FATHMM | prediction of functional effects of protein missense mutations |
GERP++ | quantification of substitution deficits in multiple alignments |
IGV | interactive genome visualization |
InterPro | functional analysis of proteins |
LINSIGHT | model generator for estimation of negative selection on noncoding sequences in the human genome |
LoFtool | gene intolerance ranking system |
MuPIT | mapping genomic coordinates of SNVs on 3D protein structures |
MutPred | classification of amino acid substitution |
PhD-SNPg | predictor for pathogenic variants in coding and non-coding regions |
REVEL | predictor for pathogenicity of missense variants |
VEST | predictor for functional significance of missense mutations based on the probability that they are pathogenic |
For local installation on Mac and Windows see the Quickstart guide here
Link to Hackathon Plan and Workflow Slides
(1) Add sources of single cell RNA-seq expression data
The Allen Brain Atlas includes a gene expression survey in multiple adult control brains integrating anatomic and genomic information. The dataset includes more than 62,000 gene probes per profile with around 500 samples per hemisphere across cerebrum, cerebellum and brainstem. In this work, we seek to create an Open-CRAVAT annotator that displays whether a given gene is expressed within different regions of the brain. A potential application for this annotator would include supporting the analysis of variants putatively related to ASD and other neurodevelopmental disease, to ascertain variants within genes that are expressed in the appropriate brain regions.
(2) Incorporate additional representation for under-studied populations
The Human Genome Diversity Project from a group of scientists across several labs at Stanford University analyzed DNA from 1,043 individuals among 51 different populations of Africa, Europe, Middle East, South and Central Asia, East Asia, Oceania and the Americas. Details on the individuals included in this collection are described in H. Cann et al. Science 296: 261-262 (2002) and its Supplemental Data; Rosenberg et al. Science 298: 2381-2385 (2002); and Rosenberg et al. PLoS Genetics 1: 660-671 (2005).
In particular, native American and Middle Eastern populations represent populations that are not well represented in Open-CRAVAT. In addition, several subpopulations evaluated in this work do not have representation in either the 1000 Genomes Project or gnomAD, two sources currently available in the Open-CRAVAT store. Due to small sample size n<10 for several subpopulations, we are required to present aggregate per-population allele frequencies to present more meaningful values.
The Online Archive of Brazilian Mutations is a variant repository containing genomic variants of Brazilians, with the goal to provide the community with genetic variability found in Brazil. The initial deposited cohort comprise exomic variants of 609 elderly individuals from a census-based sample from the city of São Paulo. A total of 2,382,573 variants were called before filtering and are available at our browser. From that total, 1,264,224 are high confidence (GATK PASS flags and excluding CEGH-USP FDP/FAB flags), which we retain for use in Open-CRAVAT.
(3) Stretch Goal: Identify sources of curated gene lists for gene set enrichment analysis (GSEA)
Gene lists are groups of genes known to be influential in the development and/or maintenance of molecular pathways or diseases. We hope to use these gene lists in the following ways:
- developing a module that allows users to see if the variants in their uploaded file correspond to a significant proportion of genes in a list
- developing functionality to flag genes that are in already available lists or user-curated lists
To accomplish these tasks, we will need to carefully format the data and generate several accessory files necessary for incorporation into Open-CRAVAT
Components necessary to create an annotator (from Open-CRAVAT wiki)
An Open-CRAVAT annotator consists of a python file, a YAML file, a data directory, and a markdown file. The file structure is
annotator/
|───annotator.md
|───annotator.yml
|───annotator.py
└───data/
annotator.md
: The markdown file describes the module to prospective users.
annotator.yml
: The YAML file defines the input and output interfaces between an annotator and the rest of Open-CRAVAT. The YAML file specifies what data will be fed to annotator.py
, and what data Open-CRAVAT should expect annotator.py
to return.
annotator.py
: The python module receives input data describing a single variant/gene, and uses it to lookup additional information specific to that annotator. An annotator.py
works by extending a provided base class, BaseAnnotator
, and implementing three instance methods: setup
, annotate
, and cleanup
.
We will aggregate single cell human RNA-seq data from the Allen Brain Atlas for genes of known functional significance in the brain to generate gene expression plots across several brain regions.
Raw RNA-Seq data from the Allen Brain Atlas
The brain regions we are examining are the Anterior Cingulate Cortex (7283 single cells), the Lateral Geniculate Nucleus (1576 single cells), the Medial Temporal Gyrus, and Primary Visual Cortex.
Example of Huntington Protein expression in two brain regions
Example of APOE expression in two brain regions
Example of boxplot output for a single gene across regions
HGDP allele frequencies
We obtained the HGDP_938.geno file from the Human Genome Diversity Project. In total, there are 938 individuals from 52 populations. Populations were grouped into 7 subsets based on geographical locations. The population subsets were: African(129) , European (159), East_Asian(229), cental and south Asian(200), Oceanian(28), Middle Eastern(133) and Native Americans(63). Alternative allele frequency was calculated for each population. Allele frequency was generated for each population and put into CSV format for conversion to sqlite file. In total, the compiled allele frequencies are comprised of 632,958 variants across the 7 populations.
HGDP allele frequency file format
HGDP allele frequency columns in OpenCRAVAT GUI
ABraOM allele frequencies
We obtained the file BRaOM_60+_SABE_609_exomes_annotated.gz from http://abraom.ib.usp.br/download/. After minor editing to reduce file size by the removal of additional data fields, we convert the remaining relevant data fields into a tsv file (ABraOM.tsv). The tsv file is converted into a sqlite for use by the Open-CRAVAT framework (abraom.sqlite). We additionally generated several files to interface between Open-CRAVAT and the sqlite table, as described above. Relevant files are included in the abroam folder.
Screenshot of the newly-added ABraOM Brazillian allele frequencies
- MacArthur lab: https://github.com/macarthur-lab/gene_lists
- Drug targets, essential genes, X-linked disease, mode of inheritance, minimum incidental findings
- ImmPort https://www.immport.org/shared/genelists
- Hallmark gene sets from MSigDB http://software.broadinstitute.org/gsea/msigdb/genesets.jsp?collection=H
- Network of cancer genes http://ncg.kcl.ac.uk/
- Can generate cancer-type specific gene lists using "Advanced" option
- NetVenn collection of gene sets for humans and animals https://probes.pw.usda.gov/NetVenn/downloads.php
- SFARI GENE https://www.sfari.org/resource/sfari-gene/
- Data is messy, even if it is made available through a “reputable” institution
- Data cleaning can be (and most likely will be) time consuming
- Our recommendation is to budget 2x-3x more time for data processing than you
hopeexpect to
- Our recommendation is to budget 2x-3x more time for data processing than you
- Our approach to the work was to split into teams based on individual expertise
- This was crucial to our success, but developing a more detailed flowchart (e.g. understanding everyone's roles and the data everyone will be using) before splitting would have helped in bringing parts of the project together in the end
- This work would not have been accomplished in the past 3 days without the following packages
- in Python
- pandas - a library containing data structures and data analysis tools
- bokeh - a visualization library
- PyLiftover - a library for quick and easy conversion of genomic (point) coordinates between different assemblies.
- in R
- in SQLite
- DB Browser - a high quality, visual, open source tool to create, design, and edit database files compatible with SQLite
- in Python