Prophage Tracer: Precisely tracing prophages in prokaryotic genomes using overlapping split-read alignment.
Based on our analysis, in order to detect prophages with low excision rate, 100–1000× sequencing depth for a genome is recommended. At this range of sequencing depth, Prophage Tracer can detect the hidden prophages with excision rates (attB/gyrB) >
It should be noted that some users reported that blast+ 2.16.0 on Ubuntu may generate error and exit the shell script when using makeblastdb. However, this bulit DB is ok for downstream command. Therefore, I have updated the shell script to ignore this error message and continue with the following command. I also suggest to install blast+ 2.6.0 as mentioned in "System and software requirements".
- Linux (Tested in CentOS 6.8 and CentOS Linux release 7.8.2003 (GNU Awk 4.0.2))
- blastn: 2.6.0+, Download ncbi-blast-2.6.0+-x64-linux.tar.gz for linux system.
- bwa 0.6
- sambamba 0.8.0
- samtools 1.10
- Shovill 1.1.0 for assembling genomes. It is useful for detecting prophages assembled into their own separate contigs in contig-level genomes. In this case, the average sequencing depth of prophage-derived contigs is usually but not necessarily significantly higher than other contigs. The depth is written into the name of each contig in the output of Shovill.
- Trimmomatic 0.39 for removing low-quality regions and adapters in reads.
- Just download the shell scripts prophage_tracer.sh and prophage_tracer.sh to your working directory
- Install required softwares through Conda
wget -c https://repo.continuum.io/miniconda/Miniconda2-latest-Linux-x86_64.sh
bash Miniconda2-latest-Linux-x86_64.sh
export PATH=~/miniconda2/bin:$PATH
conda install -c bioconda bwa
conda install -c bioconda sambamba
conda install -c bioconda samtools
Assume that you have a sequenced bacterium (strain1) genome in FASTA format reference_genome_strain1.fasta
, and paried reads in FASTQ format 1.fastq.gz
and 2.fastq.gz
- Align reads to the reference genome
bwa index reference_genome_strain1.fasta -p strain1
bwa mem strain1 1.fastq.gz 2.fastq.gz >strain1.sam
- Remove duplicate (this step is not necessary but will improve the output)
samtools view -S -b strain1.sam -o strain1.bam
sambamba markdup -r strain1.bam strain1.rmdup.bam
samtools view strain1.rmdup.bam -o strain1.rmdup.sam
bash prophage_tracer.sh -m strain1.rmdup.sam -r reference_genome_strain1.fasta -p strain1
usage: prophage_tracer [options] -m <in.sam> -r <in.fasta> -p <prefix>
options:
-m FILE a full SAM file (required)
-r FILE a reference genome sequence (required)
-p STRING prefix of output files (required; usually a strain name or a sample name)
-x INT maximal size of a prophage (default: 150000)
-n INT minimal size of a prophage (default: 5000)
-a INT minimal length of attchment site (default: > 2)
-t INT number of threads used for BlastN (default: 1)
-s INT minimal event of split reads required for supporting a prophage candidate (default: 1)
-d INT minimal event of discordant read pairs required for supporting a prophage candidat (default: 1)
- Find result in
strain1.prophage.out
prophage_candidate | contig | attL_start | attL_end | attR_start | attR_end | prophage_size | SR_evidence_attB | SR_evidence_attP | DRP_evidence_attB | DRP_evidence_attP |
---|---|---|---|---|---|---|---|---|---|---|
candidate_1 | contig00007=::=contig00014 | 209162 | 209236 | 2365 | 2439 | 16770 | 0 | 4 | 1 | 2 |
candidate_2 | contig00001 | 1064123 | 1064145 | 1100156 | 1100178 | 36033 | 0 | 1 | 0 | 0 |
candidate_3 | =contig00003::=contig00004 | 1700 | 1764 | 46895 | 46959 | 48658 | 2 | 28 | 2 | 24 |
- If a single contig was given in the
contig
column, it means an intact predicted prophage is in this contig. - "::" indicates a predicted prophage is separated into two or more contigs. "=" indicates that the 5' end or 3' end of a contig contains a part of a prophage. "=contig00003" indicates the 5' end of contig00003 while "contig00003=" indicates the 3' end of contig00003.
- "SR_evidence_attB/attP" indicate split read counts support attB or attP of the predicted prophage. "DRP_evidence_attB/attP" indicate discordant read pair counts support attB or attP of the predicted prophage.
- Prophage Phm2 (candidate_2) locates on contig00001 (1064123-1100178) with attL (1064123-1064145) and attR (1100156-1100178). Prophage Phm1 (candidate_1) is separated into two or more contigs and consists at least 3' end of contig00007 (att site: 209162-209236 on contig00007) and 5' end of contig00014 (att site: 2365-2439 on contig00014). Prophage Phm3 (candidate_3) is seperated into two or more contigs and consist at least 5' end of contig00003 (att site: 1700-1764 on contig00003) and 5' end of contig00014 (att site: 46895-46959 on contig00014).
prophage_tracer.sh
is used for chromosome-level genomes.prophage_tracer_WGS.sh
can be used for chromosome-level and contig-level genomes. However, using prophage_tracer_WGS.sh for analysis of chromosome-level genomes would be slow.- If a prophage is separated into two or more contigs, the predicted prophage size might be smaller than the true size.
- If you have generated a SAM file using samtools, our script can be equally used to predict prophages on Windows 10 with Git Bash and blastn (
ncbi-blast-2.6.0+-win64.exe
) installed. For installing Git Bash and blastn in Windows 10, please refer to https://git-scm.com/downloads and https://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/2.6.0/.
genome file: GCF_009846525.1_ASM984652v1_genomic.fna reads files: test_R1.fq.gz and test_R2.fq.gz
1.Mapping reads to the genome
bwa index GCF_009846525.1_ASM984652v1_genomic.fna -p test-strain
bwa mem test-strain test_small_R1.fq.gz test_small_R2.fq.gz test-strain.sam
2.Run prophage_tracer
bash prophage_tracer.sh -m test-strain.sam -r GCF_009846525.1_ASM984652v1_genomic.fna -p test-strain
3.Output of test
prophage_candidate contig attL_start attL_end attR_start attR_end prophage_size SR_evidence_attB SR_evidence_attP DRP_evidence_attB DRP_evidence_attP
candidate_1 NZ_CP024621.1 2090511 2090576 2139945 2140010 49434 0 1 0 1
#### Using `generate_DNA.sh` for generating simulated genomes resulted from prophage excision
------
Install `seqkit` first
```Bash
conda install -c bioconda seqkit
Download generate_DNA.sh
and random_DNA.py
Run script (default: simulating 20 genomes; one prophage in each genome)
bash generate_DNA.sh
- Adding a parameter of
blastn
to set whether mismatch were allowed in the att sites (-penalty). - To make the script to be able to analyze single read sequencing data.
Version 1.0.3: ignore the error message caused by makeblastdb and continue with the following command
Kaihao Tang, khtang@scsio.ac.cn; Xiaoxue Wang, xxwang@scsio.ac.cn; Marine Biofilm Lab; SCSIO, Chinese Academy of Sciences