OVARIE: Open-CRAVAT VARiant Interpretation Expansion

Hackathon Members

Kymberleigh Pagel, Johns Hopkins University, Baltimore MD 21218, kpagel1@jhu.edu
Anna Chang, Johns Hopkins University, Baltimore MD 21205, achang44@jhmi.edu
Zhi Liu, NIH, Bethesda MD 20892, zhi.liu@nih.gov
Summer Rankin, Booz Allen Hamilton, Rockville MD 20852, rankin_summer@bah.com, summer.rankin@fda.hhs.gov
Danielle Rubin, NIH, Bethesda MD 20892, danielle.rubin@nih.gov
Chris Shin, NIH/NIAID, Bethesda MD 20892, chris.shin@nih.gov

Introduction to Open-CRAVAT

Open-CRAVAT is a python package that performs genomic variant interpretation. The modular and locally-installed command-line or GUI interface allows for annoations of gene- and variant-level impact, interactions, conservation, and scoring. In this work, we advance the platform by the addition of new annotation sources to assist in the interrogation of genetic variation.

Example of the Open-CRAVAT interface

Open-CRAVAT Store

As of May 2019, the Open-CRAVAT Store has 65 annotators, converters, and visualization widgets available for use. Annotators range from genomic feature insights to population-level frequencies and clinical interpretations. Convertors take input in forms other than VCF (Variant Cell Format), such as dbSNP rsid and 23andMe formats, and make them compatible to Open-CRAVAT. There are currently three visualization widgets that can be downloaded: a summary of the top genes ranked by total variants, a haplotype information widget, and an embedded interactive genome visualization component developed by Integrative Genomics Viewer (IGV). Items in the store range from 2 KB to 29 GB, and they can be individually downloaded according to the user's research interests.

Home Page for the Open-CRAVAT store

Data Sources Accessible through Open-CRAVAT	Type of Data
1000 Genomes Project	allele frequencies
The Cancer Genome Atlas	genome-wide chromatin accessibility profiles of tumor samples
BioGRID	gene interactions
BRCA1 Multiplex Assay	functional scores for SNVs
Cancer Gene Census (CGC)	gene level data
Cancer Gene Landscape	oncogenes and suppressor genes
CIViC	clinical interpretation of variants in cancers
ClinVar	relationships among human variations and phenotypes
COSMIC	somatic mutations in cancer
dbSNP	single nucleotide substitutions and short deletion and insertion polymorphisms
denovo-db	germline de novo variants
NHLBI GO Exome Sequencing Project (ESP)	exome variants
Essential Genes	genetic variation and mutational burden in human orthologs
ExAC Gene and CNV	probability of loss-of-function intolerance/intolerance
Flanking Sequence	reference and alternate sequences for flanking bases
Gene Ontology	functions of genes and gene products
GHIS	haploinsufficiency scores
gnomAD	exome and genome sequencing data
GRASP	genome-wide associations between SNPs and phenotypes
GTEx	correlations between genotype and tissue-specific gene expression levels
IntAct	molecular interaction data
Mutation Assessor	prediction of functional impact of amino-acid substitutions in proteins
NCBI Gene	gene descriptions
ncRNA	non-coding RNA at variant location
NDEx	biologic pathways
p(HI)	gene-based haploinsufficiency predictions
P(rec)	rare and likely deleterious loss-of-function alleles
phastCons	phastCons scores for multiple alignments
phyloP	conservation scoring for multiple alignments
Promoter IR	interacting regions of promoters
Pseudogene	annotations generated by GENCODE project
PubMed	number of PubMed articles for a given gene
Repeat Sequences	annotation of repeat regions
RVIS	variation intolerance scoring
SiPhy	conservation scores based on mammal genomes
TARGET	genes directly linked to clinical action
UK10K Cohorts	genetic information from 2 twin studies
UniProt	protein sequence and annotation data
VISTA Enhancer Browser	experimentally validated enhancers

Analysis Tools Available through Open-CRAVAT	Function
CHASMplus	classification of missense mutations as drivers or passengers in human cancers
FATHMM	prediction of functional effects of protein missense mutations
GERP++	quantification of substitution deficits in multiple alignments
IGV	interactive genome visualization
InterPro	functional analysis of proteins
LINSIGHT	model generator for estimation of negative selection on noncoding sequences in the human genome
LoFtool	gene intolerance ranking system
MuPIT	mapping genomic coordinates of SNVs on 3D protein structures
MutPred	classification of amino acid substitution
PhD-SNPg	predictor for pathogenic variants in coding and non-coding regions
REVEL	predictor for pathogenicity of missense variants
VEST	predictor for functional significance of missense mutations based on the probability that they are pathogenic

Installation

For local installation on Mac and Windows see the Quickstart guide here

Hackathon Goals

Link to Hackathon Plan and Workflow Slides

(1) Add sources of single cell RNA-seq expression data

The Allen Brain Atlas includes a gene expression survey in multiple adult control brains integrating anatomic and genomic information. The dataset includes more than 62,000 gene probes per profile with around 500 samples per hemisphere across cerebrum, cerebellum and brainstem. In this work, we seek to create an Open-CRAVAT annotator that displays whether a given gene is expressed within different regions of the brain. A potential application for this annotator would include supporting the analysis of variants putatively related to ASD and other neurodevelopmental disease, to ascertain variants within genes that are expressed in the appropriate brain regions.

(2) Incorporate additional representation for under-studied populations

The Human Genome Diversity Project from a group of scientists across several labs at Stanford University analyzed DNA from 1,043 individuals among 51 different populations of Africa, Europe, Middle East, South and Central Asia, East Asia, Oceania and the Americas. Details on the individuals included in this collection are described in H. Cann et al. Science 296: 261-262 (2002) and its Supplemental Data; Rosenberg et al. Science 298: 2381-2385 (2002); and Rosenberg et al. PLoS Genetics 1: 660-671 (2005).

In particular, native American and Middle Eastern populations represent populations that are not well represented in Open-CRAVAT. In addition, several subpopulations evaluated in this work do not have representation in either the 1000 Genomes Project or gnomAD, two sources currently available in the Open-CRAVAT store. Due to small sample size n<10 for several subpopulations, we are required to present aggregate per-population allele frequencies to present more meaningful values.

The Online Archive of Brazilian Mutations is a variant repository containing genomic variants of Brazilians, with the goal to provide the community with genetic variability found in Brazil. The initial deposited cohort comprise exomic variants of 609 elderly individuals from a census-based sample from the city of São Paulo. A total of 2,382,573 variants were called before filtering and are available at our browser. From that total, 1,264,224 are high confidence (GATK PASS flags and excluding CEGH-USP FDP/FAB flags), which we retain for use in Open-CRAVAT.

(3) Stretch Goal: Identify sources of curated gene lists for gene set enrichment analysis (GSEA)
Gene lists are groups of genes known to be influential in the development and/or maintenance of molecular pathways or diseases. We hope to use these gene lists in the following ways:

developing a module that allows users to see if the variants in their uploaded file correspond to a significant proportion of genes in a list
developing functionality to flag genes that are in already available lists or user-curated lists

Flowchart

To accomplish these tasks, we will need to carefully format the data and generate several accessory files necessary for incorporation into Open-CRAVAT

Components necessary to create an annotator (from Open-CRAVAT wiki)

An Open-CRAVAT annotator consists of a python file, a YAML file, a data directory, and a markdown file. The file structure is

annotator/
    |───annotator.md
    |───annotator.yml
    |───annotator.py
    └───data/

annotator.md: The markdown file describes the module to prospective users.

annotator.yml: The YAML file defines the input and output interfaces between an annotator and the rest of Open-CRAVAT. The YAML file specifies what data will be fed to annotator.py, and what data Open-CRAVAT should expect annotator.py to return.

annotator.py: The python module receives input data describing a single variant/gene, and uses it to lookup additional information specific to that annotator. An annotator.py works by extending a provided base class, BaseAnnotator, and implementing three instance methods: setup, annotate, and cleanup.

Progress

Goal 1: Add sources of single cell RNA-seq expression data

We will aggregate single cell human RNA-seq data from the Allen Brain Atlas for genes of known functional significance in the brain to generate gene expression plots across several brain regions.

Raw RNA-Seq data from the Allen Brain Atlas

The brain regions we are examining are the Anterior Cingulate Cortex (7283 single cells), the Lateral Geniculate Nucleus (1576 single cells), the Medial Temporal Gyrus, and Primary Visual Cortex.

Example of Huntington Protein expression in two brain regions

Example of APOE expression in two brain regions

Example of boxplot output for a single gene across regions

Goal 2: Incorporate additional representation for under-studied populations

HGDP allele frequencies

We obtained the HGDP_938.geno file from the Human Genome Diversity Project. In total, there are 938 individuals from 52 populations. Populations were grouped into 7 subsets based on geographical locations. The population subsets were: African(129) , European (159), East_Asian(229), cental and south Asian(200), Oceanian(28), Middle Eastern(133) and Native Americans(63). Alternative allele frequency was calculated for each population. Allele frequency was generated for each population and put into CSV format for conversion to sqlite file. In total, the compiled allele frequencies are comprised of 632,958 variants across the 7 populations.

HGDP source data file format

HGDP allele frequency file format

HGDP allele frequency columns in OpenCRAVAT GUI

ABraOM allele frequencies

We obtained the file BRaOM_60+_SABE_609_exomes_annotated.gz from http://abraom.ib.usp.br/download/. After minor editing to reduce file size by the removal of additional data fields, we convert the remaining relevant data fields into a tsv file (ABraOM.tsv). The tsv file is converted into a sqlite for use by the Open-CRAVAT framework (abraom.sqlite). We additionally generated several files to interface between Open-CRAVAT and the sqlite table, as described above. Relevant files are included in the abroam folder.

Screenshot of the newly-added ABraOM Brazillian allele frequencies

Stretch Goal: Assemble sources of well-curated gene lists

MacArthur lab: https://github.com/macarthur-lab/gene_lists
- Drug targets, essential genes, X-linked disease, mode of inheritance, minimum incidental findings
ImmPort https://www.immport.org/shared/genelists
Hallmark gene sets from MSigDB http://software.broadinstitute.org/gsea/msigdb/genesets.jsp?collection=H
Network of cancer genes http://ncg.kcl.ac.uk/
- Can generate cancer-type specific gene lists using "Advanced" option
NetVenn collection of gene sets for humans and animals https://probes.pw.usda.gov/NetVenn/downloads.php
SFARI GENE https://www.sfari.org/resource/sfari-gene/

Lessons Learned

Data is messy, even if it is made available through a “reputable” institution
Data cleaning can be (and most likely will be) time consuming
- Our recommendation is to budget 2x-3x more time for data processing than you ~~hope~~ expect to
Our approach to the work was to split into teams based on individual expertise
- This was crucial to our success, but developing a more detailed flowchart (e.g. understanding everyone's roles and the data everyone will be using) before splitting would have helped in bringing parts of the project together in the end
This work would not have been accomplished in the past 3 days without the following packages
- in Python
  - pandas - a library containing data structures and data analysis tools
  - bokeh - a visualization library
  - PyLiftover - a library for quick and easy conversion of genomic (point) coordinates between different assemblies.
- in R
  - seurat - a package designed for QC, analysis, and exploration of single-cell RNA-seq data
  - matrix - a library containing matrix classes
  - dplyr - a package containing tools for manipulating data frame like objects
- in SQLite
  - DB Browser - a high quality, visual, open source tool to create, design, and edit database files compatible with SQLite

Name		Name	Last commit message	Last commit date
Latest commit History 171 Commits
RNAseq		RNAseq
abraom		abraom
geneLists		geneLists
hgdp		hgdp
images		images
HGDP_AlleleFreqCal.R		HGDP_AlleleFreqCal.R
LICENSE		LICENSE
README.md		README.md
abraom.sqlite		abraom.sqlite
base_annotator.py		base_annotator.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

OVARIE: Open-CRAVAT VARiant Interpretation Expansion

Hackathon Members

Introduction to Open-CRAVAT

Open-CRAVAT Store

Installation

Hackathon Goals

Flowchart

Components necessary to create an annotator (from Open-CRAVAT wiki)

Progress

Goal 1: Add sources of single cell RNA-seq expression data

Goal 2: Incorporate additional representation for under-studied populations

Stretch Goal: Assemble sources of well-curated gene lists

Lessons Learned

About

Releases

Packages

Contributors 6

Languages

License

NCBI-Hackathons/Expanding-OpenCRAVAT

Folders and files

Latest commit

History

Repository files navigation

OVARIE: Open-CRAVAT VARiant Interpretation Expansion

Hackathon Members

Introduction to Open-CRAVAT

Open-CRAVAT Store

Installation

Hackathon Goals

Flowchart

Components necessary to create an annotator (from Open-CRAVAT wiki)

Progress

Goal 1: Add sources of single cell RNA-seq expression data

Goal 2: Incorporate additional representation for under-studied populations

Stretch Goal: Assemble sources of well-curated gene lists

Lessons Learned

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 6

Languages

Packages