d2ssect (pronounced dissect) calculates an alignment-free distance between samples based on frequencies of shared kmers. Specifically, it provides a fast implementation of the D2S statistic which can be used as a standalone distance measure, or as input to a range of methods (eg see these tools) for phylogenetic and network analysis.
d2ssect
is available via pypi. Installation requires python 3.7 or greater as well as the jellyfish program and libraries. We recommend installation into a conda environment as follows
conda create -n d2ssect python=3.7 kmer-jellyfish
conda activate d2ssect
pip install d2ssect
d2ssect -h
Alternatively, you may use an existing Jellyfish installation, or install Jellyfish without using conda. If using this method please note that;
- Jellyfish version 2 is required (Jellyfish 1 will not work)
- Installation of Jellyfish via linux package managers will not work as this installs the jellyfish binary but not libraries and headers needed by
d2ssect
Once Jellyfish is installed you should then be able to install d2ssect
using pip or pip3 as follows
pip install d2ssect
Lets say we have a collection of fastq files corresponding to sequencing reads from different samples. We want to compare these with d2ssect
. First count kmers in these files using jellyfish
for f in *.fastq;do jellyfish count -m 21 -s 10000000 $f -o ${f%.fastq}.jf ;done
Note that the command above will create a corresponding .jf
file for every .fastq
file in the current directory. By keeping the base names of the jf
and fastq
files identical we can then run d2ssect
as follows;
d2ssect -l *.jf -f *.fastq
d2ssect
provides information on progress (sent to stderr) and will eventually produce a matrix of pairwise D2S values (one for each pair of samples) sent to stdout.