-
Notifications
You must be signed in to change notification settings - Fork 11
skani basic usage guide
# sketch genomes, output in sketch_folder, 20 threads
skani sketch genome1.fa genome2.fa ... -o sketch_folder -t 20
# use sketch file for computation
skani dist sketch_folder/genome1.fa.sketch sketch_folder/genome2.fa.sketch
sketch
computes the sketch (a.k.a index) of a genome and stores it in a new folder. For each file genome.fa
, the new file sketch_folder/genome.fa.sketch
is created.
The .sketch
files can be used as faster drop-in substitutes for fasta files.
A special file markers.bin
is also constructed and used specifically for the search
command. Modifying the output folder may invalidate the search
command.
# query each individal record in a multi-fasta (--qi for query, --ri for reference)
skani dist --qi -q query1.fa -r ref1.fa
# use lists of fastas, one line per fasta
skani dist --rl ref_list.txt --ql query_list.txt
# estimate confidence interval
skani dist --ci sketch_folder/genome1.fa.sketch sketch_folder/genome2.fa.sketch
# turn off the ANI debiasing step
skani dist query1.fa ref1.fa --no-learned-ani
# only compare small contigs with >= 90% estimated ANI
# use -m 300 for better filtering on small contigs
skani dist --qi query.fa --ri ref.fa -s 90 -m 300
dist
computes ANI between all queries and all references. dist
loads all reference and query genomes into memory.
If you're searching against a database, search
can use much less memory (see below). With default settings, the entire GTDB-R207 database (65000 bacterial genomes) takes about 95 GB of RAM with dist
.
If you want to do all-to-all comparisons for the same set of genomes g1.fa g2.fa g3.fa
, use skani triangle g1.fa g2.fa g3.fa -E
instead of skani dist -q g1.fa g2.fa g3.fa -r g1.fa g2.fa g3.fa
-- it is 2x faster.
All skani ANI calculations, including dist
, turns on a trained ANI debiasing step to make ANI more accurate.
See the advanced usage guide for information on optimally using skani when
- genomes are small (
-m
parameter becomes important) - genomes are very fragmented (
-c
becomes important) - and more.
# any algorithm options used in "sketch" will also be used for searching.
skani sketch genome1.fa genome2.fa ... -o database
# query query1.fa, query2.fa, ... against sketches in sketch_folder
skani search -d database query1.fa query2.fa ... -o output.txt
search
is a memory efficient method of calculating ANI against a large reference database. Searching against
the GTDB database (> 65000 genomes) takes only 6 GB of memory using search
. This is achieved by only
fully loading genomes that pass a filter into memory, and discarding the index after each query is done.
The parameters for search
are obtained from the parameters used for the sketch
option, so if you sketch with say -c 60
for the sketch
option, this will be implied in the search
option.
If you're querying many sequences, the file I/O step will dominate the running time, so consider
using dist
instead if you have enough RAM.
# all-to-all ANI comparison in lower-triangular matrix
skani triangle genome1.fa genome2.fa genome3.fa -o lower_triangle_matrix.txt
# output sparse matrix a.k.a an edge list of comparisons
skani triangle -l list_of_genomes.txt -o sparse_matrix.txt --sparse
# output square matrix
skani triangle genome1.fa genom2.fa genome3.fa --full-matrix
triangle
outputs a lower-triangular matrix in phyllip format. The ANI is output to stdout or the file specified. The aligned fraction is also output in a separate file with the suffix .af
attached.
triangle
avoids doing n^2 computations and only does n(n-1)/2 computations as opposed to dist
, so it is more efficient. It also sets some smarter default parameters for all-to-all search.
For very large data sets, use -E or --sparse instead, which only outputs non-zero entries in an edge-list output.
triangle
loads all genome indices into memory. For doing comparisons on massive data sets, see the Advanced section for suggestions on reducing memory cost.