Skip to content

Latest commit

 

History

History
153 lines (127 loc) · 7.18 KB

File metadata and controls

153 lines (127 loc) · 7.18 KB

HIPS: Hiding In Plain Site

Unannotated structural variants in public genomic data sets

Contributors:

Rebecca Torene, Alexandrea Stylianou, Jitong Cai, Ariel Gershman, Alexandra Weber, and Nancy Hansen

With thanks to former UMUC Bioinformatics master program students:

  • Mary Schramko
  • Upasana Pandey
  • Eric Keller
  • Mike Fox
  • Al Koroma
  • David Jalali

Premise:

Public data, such as gnomAD, makes VCFs of short nucleotide variants (SNVs) available. There is anecdotal evidence that some apparent “insertions” are actually part of larger structural variants

Goal:

Use VCFs of SNVs to call structural variants. In doing so, can we identify SVs that have otherwise escaped detection.

Description:

Large, aggregate, genomics data sets of short nucleotide variants (SNVs) such as gnomAD/ExAC are an invaluable resource to the research and clinical genetics communities. Such data sets are often used as a control to determine the frequencies of variants in unaffected populations and gnomAD has recently released a structural variant call set on a subset of their samples. There is evidence, however, that additional SVs exist in the data, but haven't yet been annotated. For example, this insertion is actually indicative of a known deletion. Such anecdotal findings of deletions, processed pseudogenes, inversions, tandem duplications, and mobile element insertions exist in publicly available data, hiding there in plain sight.

Without individual alignment files, the broader genomics community cannot evaluate structural variants in the complete data set. We, therefore, developed HIPS, a structural variant caller that uses VCF of SNVs as input and then outputs structural variant calls in bed format. HIPS starts from VCF, parses insertion sequences and converts to fastA, aligns by BLAST, and filters BLAST results for possible structural variants. Running the snakefile then merges similar calls and adds annotation. The Shiny app takes gnomAD data processed by HIPS and provides a web interface.

Suggested Future Improvements:

This project may be on GitHub and it may have a great name (HIPS), but we cannot deny the fact that this is the product of a 3-day hackathon and that it is imperfect. We welcome the GitHub community to make improvements and suggest the following as future improvements:

  • fastaToBed.py can be modified to split VCFs by position (already splits by chromosome) so that parallelization is more efficient
  • fastaToBed.py arguments could be refined (e.g., meiOnly could be boolean flag instead of string input)
  • stitch together fastaToBed.py and snakefile to have a seamless pipeline from VCF to annotated SV calls
  • Add more flexibility for file names, species, etc.
  • Option to output SV calls as VCF (currently outputs BED)
  • Add a low quality flag for SVs based on bitscore and based on overlap of breakpoints with simple repeats

flowchart

Web Interface

hosted at https://hidinginplainsight.shinyapps.io/mergedapps/

Graph View

graph_view

Table View

graph_view


Results from applying to gnomAD SNV VCFs:

counts


Getting Started

For front end development - download R and Rstudio - more info here

Once Rstudio is installed, import:

shiny
dplyr
shinyWidgets
devtools

To host the shiny app we will use:

https://www.shinyapps.io

For back end development - follow the steps below to create your environment and install an editor.

I prefer the free version of Pycharm, found here

  1. Download miniconda and choose Python 3.7 64-bit installer.
After downloading it make sure to open a new terminal and test with `which conda`.

For example:

MDMBASTYLIANOU:~ astylianou900045$ which conda
/Users/astylianou900045/miniconda3/bin/conda
  1. Add conda channels - do this to get all the packages needed for this project:
conda config --add channels defaults
conda config --add channels bioconda
conda config --add channels conda-forge

Then check that everything installed correctly with: conda config --show channels

For example:

MDMBASTYLIANOU:~ astylianou900045$ conda config --show channels
channels:
  - conda-forge
  - bioconda
  - defaults

Now that conda is installed, clone the repo

3. git clone https://github.com/NCBI-Hackathons/Hiding-in-plain-sight-unannotated-structural-variants-in-public-genomic-data-sets.git

Cloning will make a new directory. After it is cloned, enter that directory with 'cd'

For example:

4. cd Hiding-in-plain-sight-unannotated-structural-variants-in-public-genomic-data-sets

Time to create and activate the environment for this project:

  1. conda env create -f environment.yml

  2. conda activate sv_env

You will know that it was activated because sv_env will show in your command prompt, for example:

(sv_env) MDMBASTYLIANOU:~ astylianou900045$
  1. Run test data in MEI-only mode. Should create a bed file with MEI calls.
python /path/to/repo/vcfToBed.py \
-inFile /path/to/repo/test/test.vcf.gz \
-outFile out.fa -chr 21 \
-dir /path/to/repo/resources -meiOnly True

Where:

  • -inFile is a gzip'ed VCF with SNV calls in hg19.
  • -outFile is the name of the output for the fastA
  • -chr is the chromosome to work on (without leading 'chr')
  • -dir is the directory where BLAST dictionaries are located. The MEI BLAST dictionary is located in the resources directory of this repo
  • -meiOnly is whether to detect only MEI structural variants.

Running on other SV types (deletions, inversions, and tandem duplications)

  1. Download human reference sequences for each chromosome for hg19 ftp://hgdownload.cse.ucsc.edu/goldenPath/hg19/chromosomes/README.txt

  2. Make BLAST database for each chromosome

for f in *.fa.gz; do gunzip -k $f; done
for f in *.fa; do makeblastdb -in $f -dbtype nucl; done
  1. Make sure your MEI BLAST database and chromosome BLAST databases are in the same directory

  2. Run test data

python /path/to/repo/vcfToBed.py \
-inFile /path/to/repo/test/test.vcf.gz \
-outFile out.fa -chr 21 \
-dir /path/to/repo/resources -meiOnly ""