Vclust is an alignment-based tool for fast and accurate calculation of Average Nucleotide Identity (ANI) between complete or metagenomically-assembled viral genomes. The tool also performs ANI-based clustering of genomes according to standards recommended by international virus consortia, including International Committee on Taxonomy of Viruses (ICTV) and Minimum Information about an Uncultivated Virus Genome (MIUViG).
- Features
- Requirements
- Installation
- Quick Start
- Input data
- Usage
- Use cases
- Limitations
- Tests
- Citation
- License
Vclust uses a Lempel-Ziv-based pairwise sequence aligner (LZ-ANI) for ANI calculation. LZ-ANI achieves high sensitivity in detecting matched and mismatched nucleotides, ensuring accurate ANI determination. Its efficiency comes from a simplified indel handling model, making LZ-ANI magnitudes faster than alignment-based tools (e.g., BLASTn, MegaBLAST) while maintaining comparable accuracy to the most sensitive BLASTn searches.
Vclust offers multiple similarity measures between two genome sequences:
- ANI: The number of identical nucleotides across local alignments divided by the total length of the alignments.
- Global ANI (gANI): The number of identical nucleotides across local alignments divided by the length of the query/reference genome.
- Total ANI (tANI): The number of identical nucleotides between query-reference and reference-query genomes divided by the sum length of both genomes. tANI is equivalent to the VIRIDIC's intergenomic similarity.
- Coverage (alignment fraction): The proportion of the query/reference sequence aligned with the reference/query sequence.
- Number of local alignments: The count of individual alignments found between the sequences.
- Ratio between genome lengths: Ratio by dividing the length of the shorter sequence by the length of the longer sequence.
Vclust provides six clustering algorithms tailored to various scenarios, including taxonomic classification and dereplication of viral genomes.
- Single-linkage
- Complete-linkage
- UCLUST
- CD-HIT (Greedy incremental)
- Greedy set cover (adopted from MMseqs2)
- Leiden algorithm [optional]
Vclust uses three efficient C++ tools - Kmer-db, LZ-ANI, Clusty - for prefiltering, aligning, calculating ANI, and clustering viral genomes. This combination enables the processing of millions of virus genomes within a few hours on a mid-range workstation.
For datasets containing up to 1000 viral genomes, Vclust is available at http://www.vclust.org.
Vclust requires Python 3.7 or higher.
To install Vclust, you can either download the pre-compiled binaries, install from Bioconda, or compile the dependencies from source. The compilation process typically takes a few minutes.
The quickest way to get started is by downloading prebuilt binaries from the Releases tab. These binaries include the Leiden algorithm by default. Select your platform and download the tool.
Vclust is also available on Bioconda.
TODO
make
g++
version 11 or highercmake
version 3.12 or higher
The default installation of Vclust includes all functionalities except for the Leiden clustering algorithm.
git clone --recurse-submodules https://github.com/refresh-bio/vclust
cd vclust
make -j
Vclust provides igraph's implementation of the Leiden algorithm. However, because igraph requires several external dependencies (CMake 3.18, Flex, Bison), it is not integrated with Vclust by default. To install these dependencies under Debian/Ubuntu Linux, use the following command:
sudo apt-get install cmake flex bison
Then, build Vclust with Leiden algorithm support:
git clone --recurse-submodules https://github.com/refresh-bio/vclust
cd vclust
make -j LEIDEN=true
Follow these steps to quickly run Vclust on the provided example genomes. The process takes just a few seconds.
- Prefilter similar genome sequence pairs before conducting pairwise alignments.
./vclust.py prefilter -i example/multifasta.fna -o fltr.txt
- Align similar genome sequence pairs and calculate pairwise ANI measures.
./vclust.py align -i example/multifasta.fna -o ani.tsv --filter fltr.txt
- Cluster genome sequences based on given ANI measure and minimum threshold.
./vclust.py cluster -i ani.tsv -o clusters.tsv --ids ani.ids.tsv --metric ani --ani 0.95
Vclust accepts a single FASTA file containing viral genomic sequences (example) or a directory of FASTA files (one genome per file) (example). The input file(s) can be gzipped.
Vclust provides three commands: prefilter
, align
, and cluster
. Calls to these commands follow the structure:
./vclust.py command -i <input> -o <output> [options]
The prefilter
command creates a pre-alignment filter, which eliminates dissimilar genome pairs before calculating pairwise alignments. This process reduces the number of potential genome pairs to only those with sufficient k-mer-based sequence similarity. The k-mer-based sequence similarity between two genomes is controlled by two options: minimum number of common k-mers (--min-kmers
) and minimum sequence identity of the shorter sequence (--min-ident
). Both k-mer-based similarities are computed using Kmer-db 2 which evaluates the entire set of k-mers, overcoming the sampling constraints typical of sketching methods like FastANI and Mash. In addittion, the use of sparse distance matrices enables memory-efficient processing of millions of genomes.
By default, the prefilter
command creates a pre-alignment filter with genome pairs that have at least 30 common 25-mers and 70% of identity over the shorter sequence.
./vclust.py prefilter -i genomes.fna -o fltr.txt
The sequence identity calculated in the prefilter step is higher than ANI value from the alignment step. This means you can safely set the --min-ident
parameter to a value close to the final ANI threshold without losing relevant genome pairs during the prefiltering process. For example, if your goal is to identify genome pairs with an ANI threshold of 95% or higher, you can set --min-ident
to approximately 0.95.
# Create a pre-alignment filter with genome sequence pairs that have
# at least 30 25-mers in common and 95% of identity over the shorter sequence.
./vclust.py prefilter -i genomes.fna -o fltr.txt --k 25 --min-kmers 30 --min-ident 0.90
To reduce memory consumption, large sets of genome sequences can be processed in smaller, equally-sized, batches of sequences.
# Process genomes in batches of 5 million sequences each.
./vclust.py prefilter -i genomes.fna -o fltr.txt --batch-size 5000000
The align
command performs pairwise sequence alignments among viral genomes and provides similarity measures like ANI and coverage (alignment fraction). It uses the LZ-ANI aligner, which, like BLAST-based methods, finds multiple local alignments (similar to HSPs in BLAST) between two genomic sequences to estimate ANI and other sequence similarity measures for that genome pair.
# Align genome pairs filtered by the prefiltering command.
./vclust.py align -i genomes.fna -o ani.tsv --filter fltr.txt
The align
command allows for further filtering based on filter's minimum threshold.
# Align only the genome pairs with sequence identity greater than or equal to 80%.
./vclust.py align -i genomes.fna -o ani.tsv --filter fltr.txt --filter-threshold 0.8
Without --filter
Vclust aligns all possible genome pairs.
./vclust.py align -i genomes.fna -o ani.tsv
The align
command creates two TSV files: one with ANI values for genome pairs and another with genome identifiers sorted by decreasing sequence length. Both TSV files can be used as input for Vclust's clustering.
The following command will create two TSV files: ani.tsv and ani.ids.tsv:
./vclust.py align -i genomes.fna -o ani.tsv --filter fltr.txt
Sample output:
qidx ridx query reference tani gani ani qcov rcov num_alns len_ratio
9 8 NC_010807.alt3 NC_010807.alt2 0.972839 0.960192 0.986657 0.973177 0.997608 60 0.9836
8 9 NC_010807.alt2 NC_010807.alt3 0.972839 0.985279 0.987642 0.997608 0.973177 67 0.9836
10 8 NC_010807.alt1 NC_010807.alt2 0.987250 0.987041 0.987117 0.999923 0.999901 34 0.9571
8 10 NC_010807.alt2 NC_010807.alt1 0.987250 0.987449 0.987547 0.999901 0.999923 36 0.9571
11 8 NC_010807.ref NC_010807.alt2 0.989807 0.989540 0.989617 0.999923 1.000000 14 0.9571
8 11 NC_010807.alt2 NC_010807.ref 0.989807 0.990063 0.990063 1.000000 0.999923 14 0.9571
10 9 NC_010807.alt1 NC_010807.alt3 0.979963 0.993250 0.994557 0.998686 0.972575 71 0.9730
9 10 NC_010807.alt3 NC_010807.alt1 0.979963 0.967035 0.994304 0.972575 0.998686 70 0.9730
11 9 NC_010807.ref NC_010807.alt3 0.983839 0.997166 0.997217 0.999948 0.974230 52 0.9730
9 11 NC_010807.alt3 NC_010807.ref 0.983839 0.970871 0.996552 0.974230 0.999948 52 0.9730
11 10 NC_010807.ref NC_010807.alt1 0.997462 0.997475 0.997475 1.000000 1.000000 23 1.0000
10 11 NC_010807.alt1 NC_010807.ref 0.997462 0.997449 0.997449 1.000000 1.000000 23 1.0000
To manage the output TSV file size, Vclust offers three formats: standard
, lite
, complete
.
./vclust.py align -i genomes.fna -o ani.tsv --filter fltr.txt --outfmt lite
Column | Standard | Lite | Complete | Description |
---|---|---|---|---|
qidx | + | + | + | Index of query sequence |
ridx | + | + | + | Index of reference sequence |
query | + | - | + | Identifier (name) of query sequence |
reference | + | - | + | Identifier (name) of reference sequence |
tani | + | + | + | total ANI [0-1] |
gani | + | + | + | global ANI [0-1] |
ani | + | + | + | ANI [0-1] |
qcov | + | + | + | Query coverage (aligned fraction) [0-1] |
rcov | + | + | + | Reference coverage (aligned fraction) [0-1] |
num_alns | + | + | + | Number of local alignments |
len_ratio | + | + | + | Length ratio between shorter and longer sequence [0-1] |
qlen | - | - | + | Query sequence length |
rlen | - | - | + | Reference sequence length |
nt_match | - | - | + | Number of matching nucleotides across alignments |
nt_mismatch | - | - | + | Number of mismatching nucleotides across alignments |
The align
command enables filtering output by setting minimum thresholds for similarity measures. This ensures only genome pairs meeting the thresholds are reported, significantly reducing the output TSV file size.
# Output only genome pairs with ANI ≥ 0.95 and query coverage ≥ 0.85.
./vclust.py align -i genomes.fna -o ani.tsv --filter fltr.txt --out-ani 0.95 --out-qcov 0.85
Vclust can output alignment details in a separate TSV file. This output format is similar to the BLASTn tabular output and includes information on each local alignment between two genomes, such as the coordinates in both the query and reference sequences, strand orientation, the number of matched and mismatched nucleotides, and the percentage of sequence identity.
./vclust.py align -i genomes.fna -o ani.tsv --aln ani.aln.tsv
Sample output:
query reference pident alnlen qstart qend rstart rend nt_match nt_mismatch
NC_025457.alt2 NC_025457.ref 89.2893 999 22119 23117 14207 15163 892 107
NC_025457.alt2 NC_025457.ref 89.8305 826 3373 4198 2202 3020 742 84
NC_025457.alt2 NC_025457.ref 91.0804 796 41697 42492 27680 28475 725 71
NC_025457.alt2 NC_025457.ref 87.2483 745 38039 38783 24969 25688 650 95
NC_025457.alt2 NC_025457.ref 89.8860 702 7269 7970 5077 5778 631 71
NC_025457.alt2 NC_025457.ref 93.2081 692 62572 63263 41329 42020 645 47
NC_025457.alt2 NC_025457.ref 90.9565 575 31121 31695 20438 21003 523 52
NC_025457.alt2 NC_025457.ref 90.6195 565 11476 12040 7999 8563 512 53
NC_025457.alt2 NC_025457.ref 91.6211 549 10905 11453 7455 8003 503 46
NC_025457.alt2 NC_025457.ref 86.7041 534 29624 30157 19067 19586 463 71
NC_025457.alt2 NC_025457.ref 93.5673 513 10149 10661 6915 7427 480 33
NC_025457.alt2 NC_025457.ref 89.3701 508 34017 34524 22188 22695 454 54
NC_025457.alt2 NC_025457.ref 88.0240 501 18330 18830 11549 12049 441 60
Column | Description |
---|---|
query | Identifier (name) of query sequence |
reference | Identifier (name) of reference sequence |
pident | Percent identity of local alignment |
alnlen | Alignment length |
qstart | Start of alignment in query |
qend | End of alignment in query |
rstart | Start of alignment in reference |
rend | End of alignment in reference |
nt_match | Number of matched (identical) nucleotides |
nt_mismatch | Number of mismatching nucleotides |
The cluster
command takes as input two TSV files (outputs of vclust align
) and clusters viral genomes using a user-selected clustering algorithm and similarity measures. Internally, Vclust uses Clusty, enabling large-scale clustering of millions of genome sequences by leveraging sparse distance matrices.
Vclust supports six clustering algorithms (--algorithm
):
single
: Single-linkage clusteringcomplete
: Complete-linkage clusteringuclust
: UCLUST-like clusteringcd-hit
: Greedy incremental clusteringset-cover
: Set-Cover (greedy)leiden
: the Leiden algorithm
Vclust performs threshold-based clustering by assigning a genome sequence to a cluster if its similarity (e.g., ANI) to the cluster meets or exceeds a user-defined threshold. The specific similarity criterion depends on the clustering algorithm and may refer to the nearest genome, the furthest genome, or the centroid within the cluster.
Users can choose from three similarity measures for clustering using the --metric
option: tani
, gani
, or ani
. The threshold provided for the selected measure determines the minimum similarity required to connect two genomes. For asymmetric measures like ANI and gANI (where the similarity between genome A and B differs from that between B and A), Vclust uses the maximum value as the weight for clustering.
# Cluster genomes based on tANI and connect genomes if their ANI ≥ 95%.
./vclust.py cluster -i ani.tsv -o clusters.tsv --ids ani.ids.tsv --algorithm single \
--metric ani --ani 0.95
Additionally, Vclust may use extra threshold values for various combinations of similarity measures (--ani
, --gani
, --tani
, --qcov
, --rcov
, --num_alns
, --len_ratio
). If a genome pair fails to meet the specified thresholds, it is excluded from clustering.
# Cluster genomes based on ANI similarity measure, but connect them only
# if ANI ≥ 95% and query coverage ≥ 85% and reference coverage ≥ 85%.
./vclust.py cluster -i ani.tsv -o clusters.tsv --ids ani.ids.tsv --algorithm single \
--metric ani --ani 0.95 --qcov 0.85 --rcov 0.85
Vclust leverages the Leiden algorithm via the igraph implementation integrated into Clusty. As with other Vclust algorithms, you can select from three similarity measures: tANI, gANI, or ANI. These measures define the edge weights used by the Leiden algorithm.
# Cluster genomes based on ANI similarity measure, but connect them only
# if ANI ≥ 95% and query coverage ≥ 85%.
./vclust.py cluster -i ani.tsv -o clusters.tsv --ids ani.ids.tsv --algorithm leiden \
--metric ani --ani 0.95 --qcov 0.85
If you wish to include coverage information in the edge weight, following the IMG/PR's approach (where edge weight = ANI x max(qcov, rcov)), use the gani
measure for clustering.
./vclust.py cluster -i ani.tsv -o clusters.tsv --ids ani.ids.tsv --algorithm leiden \
--metric gani --gani 0.5 --ani 0.95 --qcov 0.85
The granularity of clustering can be adjusted with the --leiden-resolution
parameter [range: 0-1, default: 0.7
]. Higher resolutions produce more, smaller clusters, while lower resolutions result in fewer, larger clusters.
./vclust.py cluster -i ani.tsv -o clusters.tsv --ids ani.ids.tsv --algorithm leiden \
--metric ani --ani 0.95 --qcov 0.85 --leiden-resolution 1
The clustering can be further refined using two additional parameters. The --leiden-beta
parameter (default: 0.01
) controls the randomness in the Leiden algorithm, allowing for fine-tuning of the clustering process. The --leiden-iterations
parameter (default: 2
) specifies the number of iterations for the Leiden algorithm, where increasing the number may lead to improved clustering quality.
./vclust.py cluster -i ani.tsv -o clusters.tsv --ids ani.ids.tsv --algorithm leiden \
--metric ani --ani 0.95 --qcov 0.85 --leiden-resolution 0.8 --leiden-beta 0.02 \
--leiden-iterations 3
The cluster command generates a TSV file with genome identifiers followed by 0-based cluster identifiers. The file includes all input genomes, both clustered and singletons, listed in the same order as the IDs file (sorted by decreasing sequence length). The first genome in each cluster is the representative. For example, cluster 0
has NC_005091.alt2
as its representative and includes NC_005091.alt1
and NC_005091.ref
.
object cluster
NC_025457.alt2 3
NC_005091.alt2 0
NC_005091.alt1 0
NC_005091.ref 0
NC_002486.alt 1
NC_002486.ref 1
NC_025457.ref 4
NC_025457.alt1 5
NC_010807.alt2 2
NC_010807.alt3 2
NC_010807.alt1 2
NC_010807.ref 2
Alternatively, instead of numerical cluster identifiers, Vclust can output a representative genome for each cluster using the --out-repr
option. The representative genome within each cluster is determined as the one with the longest sequence.
object cluster
NC_025457.alt2 NC_025457.alt2
NC_005091.alt2 NC_005091.alt2
NC_005091.alt1 NC_005091.alt2
NC_005091.ref NC_005091.alt2
NC_002486.alt NC_002486.alt
NC_002486.ref NC_002486.alt
NC_025457.ref NC_025457.ref
NC_025457.alt1 NC_025457.alt1
NC_010807.alt2 NC_010807.alt2
NC_010807.alt3 NC_010807.alt2
NC_010807.alt1 NC_010807.alt2
NC_010807.ref NC_010807.alt2
The following commands perform VIRIDIC-like analysis by calculating the total ANI (tANI) between complete virus genomes and classifying these viruses into species and genera based on 95% and 70% tANI, respectively.
# Create a pre-alignment filter for genome pairs with a minimum of 10 common k-mers
# and a minimum sequence identity of 70% (relative to the shortest sequence).
./vclust.py prefilter -i genomes.fna -o fltr.txt --min-kmers 10 --min-ident 0.7
# Calculate ANI measures for genome pairs specified in the filter.
./vclust.py align -i genomes -o ani.tsv --filter fltr.txt
# Assign viruses into putative species (tANI ≥ 95%).
./vclust.py cluster -i ani.tsv -o species.tsv --ids ani.ids.tsv --algorithm complete \
--metric tani --tani 0.95
# Assign viruses into putative genera (tANI ≥ 70%).
./vclust.py cluster -i ani.tsv -o genus.tsv --ids ani.ids.tsv --algorithm complete \
--metric tani --tani 0.70
The following commands assign contigs into viral operational taxonomic units (vOTUs) based on the MIUViG thresholds (ANI ≥ 95% and aligned fraction ≥ 85%).
# Create a pre-alignment filter.
./vclust.py prefilter -i genomes.fna -o fltr.txt --min-kmers 30 --min-ident 0.9
# Calculate ANI measures for genome pairs specified in the filter.
./vclust.py align -i genomes -o ani.tsv --filter fltr.txt
# Cluster contigs into vOTUs using the MIUVIG thresholds and the Leiden algorithm.
./vclust.py cluster -i ani.tsv -o clusters.tsv --ids ani.ids.tsv --algorithm leiden \
--metric ani --ani 0.95 --qcov 0.85
The following commands reduce the sequence dataset to representative genomes.
# Create a pre-alignment filter.
./vclust.py prefilter -i genomes.fna -o fltr.txt --min-kmers 30 --min-ident 0.90
# Calculate ANI measures for genome pairs specified in the filter.
./vclust.py align -i genomes.fna -o ani.tsv --filter fltr.txt --out-ani 0.9
# Cluster contigs using the CD-HIT algorithm and show representative genome.
./vclust.py cluster -i ani.tsv -o clusters.tsv --ids ani.ids.tsv --algorithm cd-hit \
--metric ani --ani 0.95 --qcov 0.85 --out-repr
The following command calculates ANI measures between all genome pairs in the dataset. For small datasets, using prefilter
is optional. However, note that without it, Vclust will perform all-versus-all pairwise sequence alignments.
./vclust.py align -i genomes.fna -o ani.tsv
The following commands help reduce RAM usage and hard disk storage by using prefilter on smaller, equally-sized batches of sequences. The example shows processing over 15 million metagenomic contigs from the IMG/VR database.
# Create a pre-alignment filter by processing batches of 5 million genomes.
./vclust.py prefilter -i genomes.fna -o fltr.txt --min-kmers 30 --min-ident 0.90 \
--batch-size 5000000
# Calculate ANI measures for genome pairs specified in the filter. Keep the output TSV
# file relatively small-sized: use the lite output format and report only genome pairs
# with ANI ≥ 95% and query coverage ≥ 85%.
./vclust.py align -i genomes -o ani.tsv --filter fltr.txt --outfmt lite \
--out-ani 0.95 --out-cov 0.85
# Cluster contigs into vOTUs using the MIUVIG thresholds and the Leiden algorithm.
./vclust.py cluster -i ani.tsv -o clusters.tsv --ids ani.ids.tsv --algorithm leiden \
--metric ani --ani 0.95 --qcov 0.85
Note
Please see: 8. Limitations
Vclust is efficient for comparing genome sequences of diverse viruses across a wide range of sequence identities. However, RAM usage and running time may increase drastically in large datasets of highly similar or nearly identical genomes (e.g., hundreds of thousands from the same species).
To ensure that Vclust works as expected, you can run tests using pytest.
pytest test.py
Zielezinski A, Gudyś A, Barylski J, Siminski K, Rozwalak P, Dutilh BE, Deorowicz S. Ultrafast and accurate sequence alignment and clustering of viral genomes. bioRxiv [doi:10.1101/2024.06.27.601020].