- Overview
- System Requirements
- Installation Guide
- A test dataset to demo ViralCC
- Instruction to process raw data
- Instruction to run ViralCC
- Instruction of reproducing results in ViralCC paper
- Contacts and bug reports
- Copyright and License Information
- Issues
ViralCC
is a new open-source metagenomic Hi-C-based binning pipeline to recover high-quality viral genomes.
ViralCC
not only considers the Hi-C interaction graph, but also puts forward a novel host proximity graph of viral contigs
as a complementary source of information to the remarkably sparse Hi-C interaction map. The two graphs are then integrated together,
followed by the Leiden graph clustering using the integrative graph to generate draft viral genomes.
-
If you want to reproduce results in our ViralCC paper, please read our instructions here.
-
Scripts to process the intermediate data and plot figures of our ViralCC paper are available here.
ViralCC
requires only a standard computer with enough RAM to support the in-memory operations.
ViralCC
v1.0.0 is supported and tested in MacOS and Linux systems.
ViralCC
mainly depends on the Python scientific stack.
numpy
scipy
pysam
scikit-learn
pandas
Biopython
leidenalg
We recommend using conda to install ViralCC
.
Typical installation time is 1-5 minutes depending on your system.
git clone https://github.com/dyxstat/ViralCC.git
Once complete, enter the repository folder and then create a ViralCC
environment using conda.
cd ViralCC
conda env create -f viralcc_linux_env.yaml
or
conda env create -f viralcc_osx_env.yaml
conda activate ViralCC_env
We provide a small simulated dataset, located under the Test directory, to demo and test the software:
Test/final.contigs.fa
Test/MAP_SORTED.bam
Test/viral_contigs.txt
Run ViralCC
on the testing dataset:
python ./viralcc.py pipeline -v Test/final.contigs.fa Test/MAP_SORTED.bam Test/viral_contigs.txt Test/out_test
The expected run time for demo is several seconds and the expected output are in the 'Test/out_test' directory:
Test/out_test/cluster_viral_contig.txt
Test/out_test/prokaryotic_contig_info.csv
Test/out_test/VIRAL_BIN/VIRAL_BIN0000.fa
Test/out_test/VIRAL_BIN/VIRAL_BIN0001.fa
Test/out_test/viralcc.log
Test/out_test/viral_contig_info.csv
Follow the instructions in this section to process the raw shotgun and Hi-C data and generate the input for ViralCC
:
Adaptor sequences are removed by bbduk
from the BBTools
suite with parameter ktrim=r k=23 mink=11 hdist=1 minlen=50 tpe tbo
and reads are quality-trimmed using bbduk
with parameters trimq=10 qtrim=r ftm=5 minlen=50
. Additionally, the first 10 nucleotides of Hi-C reads are trimmed by bbduk
with parameter ftl=10
. Identical PCR optical and tile-edge duplicates for Hi-C reads were removed by the script clumpify.sh
from BBTools
suite.
For the shotgun library, de novo metagenome assembly is produced by an assembly software, such as MEGAHIT.
megahit -1 SG1.fastq.gz -2 SG2.fastq.gz -o ASSEMBLY --min-contig-len 1000 --k-min 21 --k-max 141 --k-step 12 --merge-level 20,0.95
Hi-C paired-end reads are aligned to assembled contigs using a DNA mapping software, such as BWA MEM. Then, samtools with parameters ‘view -F 0x904’ is applied to remove unmapped reads, supplementary alignments, and secondary alignments. BAM file needs to be sorted by name using 'samtools sort'.
bwa index final.contigs.fa
bwa mem -5SP final.contigs.fa hic_read1.fastq.gz hic_read2.fastq.gz > MAP.sam
samtools view -F 0x904 -bS MAP.sam > MAP_UNSORTED.bam
samtools sort -n MAP_UNSORTED.bam -o MAP_SORTED.bam
Assembled contigs were screened by a viral sequence detection software, such as VirSorter to identify viral contigs.
wrapper_phage_contigs_sorter_iPlant.pl -f final.contigs.fa --db 1 --wdir virsorter_output --data-dir virsorter-data
python ./viralcc.py pipeline [Parameters] FASTA_file BAM_file VIRAL_file OUTPUT_directory
--min-len: Minimum acceptable contig length (default 1000)
--min-mapq: Minimum acceptable alignment quality (default 30)
--min-match: Accepted alignments must be at least N matches (default 30)
--min-k: Lower bound of k for determining the host poximity graph (default 4)
--random-seed: Random seed for the Leiden clustering (default 42)
--cover (optional): Cover existing files. Otherwise, an error will be returned if the output file is detected to exist.
-v (optional): Verbose output about more specific details of the ViralCC procedure.
- FASTA_file: a fasta file of the assembled contig (e.g. Test/final.contigs.fa)
- BAM_file: a bam file of the Hi-C alignment (e.g. Test/MAP_SORTED.bam)
- VIRAL_file: a txt file containing the names of identified viral contigs in one column without header (e.g. Test/viral_contigs.txt)
- VIRAL_BIN: folder containing the fasta files of draft viral bins
- cluster_viral_contig.txt: clustering results with 2 columns, the first is the viral contig name, and the second is the group number.
- viral_contig_info.csv: information of viral contigs with three columns (contig name, contig length, and GC-content)
- prokaryotic_contig_info.csv: information of non-viral contigs with three columns (contig name, contig length, and GC-content)
- viralcc.log: log file of ViralCC
python ./viralcc.py pipeline -v final.contigs.fa MAP_SORTED.bam viral_contigs.txt out_directory
If you have any questions or suggestions, welcome to contact Yuxuan Du (yuxuandu@usc.edu).