VIROME SNIFF

A NCBI Hackathon Project Generating a Pipeline that searches Next Generation Sequencing reads using virus protein database. This tool finds already known viral sequences and viruses-like proteins and discovers sequences that match Viral Protein Domains in any single genome or metagenome sequence pool. Initial development took place at New York Genome Center, June 19-21, 2017.

Introduction
Command Line Interface Usage
Workflow schematics
Sample Input Files
Building Database
Sample Input Files
Resources and references

Background

We aimed to search for viruses in protein, rather than nucleotide space in order to capture and characterize larger number of viruses and detect virus- associated domains in the sample.

Once we have taken in any Illumina base next generation sequencing datasets (and performed adapter trimming), the workflow takes FASTQ data reads where the input genomic data is matched directly against viral protein database in order to filter out all the other sequences that are not related to viruses. The workflow further takes virus- related sequences and an assembly of those reads is performed. All the contigs that we successfully assembled are further characterized into known virus proteins, homologous virus proteins, and as virus protein domains. Know and homologous virus proteins are quantified and plotted, and taxonomical classification of those sequences is provided. Finally, samples geographical distribution and representation can be plotted on the map.

Command Line Interface Usage

usage: run_program.sh [-h] -f FASTQ [-r REVERSE FASTQ] -s SAMPLE SHEET -b
                   BARCODES [-e ERROR RATE]

optional arguments:
  -h, --help            show this help message and exit
  -f FASTQ, --Provide single-end, paire-end FASTQ file
  -s SRA, --sample-sheet SAMPLE SHEET
                        Sample table
 
  -e E-value, --provide optiaonal trashold for blast, deafualt values are .....

Workflow schematics

Building viral protein database
Trimming sequences for adapter
Detecting NGS sequences for virus using MMseq2 k-mer based algorithm for protein detection
Assemble matched reads using Abyss
Characterize known sequences and their aboundaces
Visuaize viral content, taxonomy and geographical origin

Sample Input Files

FASTQ File: link

Building Database

This is an optional step. We build a Database containing multiple differetnt viral sequences by combining three databases using MMseq2.

Databases used to create our final database

Viral Zone DB (http://viralzone.expasy.org/)

VPR (https://www.viprbrc.org/brc/home.spg?decorator=vipr)

Viral Genomes (https://www.ncbi.nlm.nih.gov/genome/viruses/)

If a new database compatible with MMseq2 needs to be created the following commnads could be used: mmseqs createdb virus_cluster.fasta MMSEQ_DB mmseqs createindex MMSEQ_DB

Software Dependencies

The following software needs to be installed:

-sratoolkit (https://github.com/ncbi/sra-tools/)

-TrimGalore (https://github.com/FelixKrueger/TrimGalore)

-MMseq2 (https://github.com/soedinglab/MMseqs2)

-Spades 3.10.1 (http://spades.bioinf.spbau.ru/release3.10.1/manual.html#sec2)

Resources and references

How to use/run a Docker image

Name		Name	Last commit message	Last commit date
Latest commit History 51 Commits
Virom_SNIFF		Virom_SNIFF
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

VIROME SNIFF

Background

Command Line Interface Usage

Workflow schematics

Sample Input Files

Building Database

Software Dependencies

Resources and references

About

Releases

Packages

Languages

License

mikejt33/Virus_Domains

Folders and files

Latest commit

History

Repository files navigation

VIROME SNIFF

Background

Command Line Interface Usage

Workflow schematics

Sample Input Files

Building Database

Software Dependencies

Resources and references

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages