Sidearm searches the SRA database for viruses using the NCBI magicBLAST tool. It generates a table describing the number of alignments to each virus and various metrics such as the sequence coverage and average depth. The reads aligning to virus are assembled into viral contigs to attempt to generate complete viral genomes.
Clone the repository
git clone https://github.com/NCBI-Hackathons/Virus_Detection_SRA
Install the following:
- magicblast (>= 1.2.0)
- samtools (>= 1.4)
- bioperl (>= 1.7)
- cutadapt (>= 1.12)
- abyss (>= 2.0.2)
- cwltool (>= 1.0.20170309164828)
This example uses an RNA-seq dataset (SRR1553459) from an Ebola virus outbreak. After cloning this repository, do the following:
cd Virus_Detection_SRA/cwl/tools
wget ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/848/505/GCF_000848505.1_ViralProj14703/GCF_000848505.1_ViralProj14703_genomic.fna.gz
gunzip GCF_000848505.1_ViralProj14703_genomic.fna.gz
makeblastdb -dbtype nucl -in GCF_000848505.1_ViralProj14703_genomic.fna -out ebolazaire -parse_seqids
export BLASTDB=$BLASTDB:`pwd`
These steps downloaded the Ebola virus genome and uncompressed it. Using the Ebola virus genome, a BLAST database was created with makeblastdb
. Then your local directory was added to the BLASTDB environmental variable.
sidearm.cwl sidearm.SRR1553459.ebola.yml
This steps runs Sidearm and generates the following primary output files.
- SRR1553459.ebolazaire.bam - magicblast alignments to Ebola virus (sorted BAM file)
- SRR1553459.ebolazaire.bam.summarize.tsv - aggregate information about the alignments to each reference sequence (in this case it is only Ebola virus)
- SRR1553459.ebolazaire.bam.fa.trim.fa.assembly.fa - the contigs generated by the assembly of all sequences that aligned to the reference sequences.
The log files for the trim and assembly modules are also created
- SRR1553459.ebolazaire.bam.fa.trim.log - the trim module logfile
- SRR1553459.ebolazaire.bam.fa.trim.fa.assembly.log - the assembly module logfile
Open the summarize.tsv
file in a spreadsheet program. The number of alignments to Ebola virus should be ~15,000 (column 'aligns'), sequence coverage ~98% (column 'seqcov'), and average depth ~75 (column 'avgdepth'). The longest contig is ~12,500bp and a BLASTN search shows that it is Ebola virus.
The avg (average) fields are the average MAPQ (or Score or EditDist) across the alignments for each subject in the BAM file. MAPQ, Score and EditDist are taken from field 5, the AS flag, and the NM flag in the BAM file, respectively.
cd Virus_Detection_SRA/cwl/tools
wget ftp://ftp.ncbi.nlm.nih.gov/refseq/release/viral/viral.1.1.genomic.fna.gz
gunzip viral.1.1.genomic.fna.gz
makeblastdb -dbtype nucl -in viral.1.1.genomic.fna -out viral.1.1.genomic -parse_seqids
export BLASTDB=$BLASTDB:`pwd`
Edit sidearm.SRR1553459.ebola.yml with the following changes and save it as sidearm.SRR073726.viral11genomic.yml:
- srr: SRR073726
- blastdb: viral.1.1.genomic
- path: viral.1.1.genomic.fna
Then execute Sidearm with the command line below. Depending on your computer, this will take about 1 hour.
sidearm.cwl sidearm.SRR073726.viral11genomic.yml
Report of alignments (summarize.tsv
)
id | vname | vlen | seqcov | avgdepth | aligns | avgMAPQ | avgScore | avgEditDist |
---|---|---|---|---|---|---|---|---|
NC_032111.1 | BeAn 58058 virus | 163005 | 0.78 | 6.7 | 51,012 | 255 | 22.7 | 0.34 |
NC_001357.1 | HPV18 | 7857 | 22.2 | 130.7 | 26,067 | 255 | 39.1 | 0.05 |
The longest contig should be ~600bp. However, obtaining this result depends on the software and reference sequences versions that you used. A BLASTN search with the longest contig sequence should show that it is Human Papillomavirus 18.
- Please submit an issue if you run into any problems installing or running this software.