This is a reimplementation of the SKA package in the rust language, by Johanna von Wachsmann, Simon Harris and John Lees. We are also grateful to have received user contributions from:
- Romain Derelle
- Tommi Maklin
- Joel Hellewell
- Timothy Russell
- Nicholas Croucher
- Dan Lu
Split k-mer analysis (version 2) uses exact matching of split k-mer sequences to align closely related sequences, typically small haploid genomes such as bacteria and viruses.
SKA can only align SNPs further than the k-mer length apart, and does not use a gap penalty approach or give alignment scores. But the advantages are speed and flexibility, particularly the ability to run on a reference-free manner (i.e. including accessory genome variation) on both assemblies and reads.
Romain Derelle, Johanna von Wachsmann, Tommi Mäklin, Joel Hellewell, Timothy Russell, Ajit Lalvani, Leonid Chindelevitch, Nicholas J. Croucher, Simon R. Harris, John A. Lees (2024). Seamless, rapid, and accurate analyses of outbreak genomic data using split k-mer analysis. Genome Research, 34(10), 1661–1673.
https://genome.cshlp.org/content/34/10/1661.abstract
Can be found at https://docs.rs/ska. We also have some tutorials available:
Choose from:
- Download a binary from the releases.
- Use
cargo install ska
orcargo add ska
. - Use
conda install -c bioconda ska2
(note the two!). - Build from source
For 2) or 4) you must have the rust toolchain installed.
If you have an M1/M2 (arm64) Mac, we aren't currently automatically building binaries, so would recommend either option 2) or 4) for best performance.
If you get a message saying the binary isn't signed by Apple and can't be run, use the following command to bypass this:
xattr -d "com.apple.quarantine" ./ska
- Clone the repository with
git clone
. - Run
cargo install --path .
orRUSTFLAGS="-C target-cpu=native" cargo install --path .
to optimise for your machine.
Optimisations include:
- Integer DNA encoding, optimised parsing from FASTA/FASTQ.
- Faster dictionaries.
- Full parallelisation of build phase.
- Smaller, standardised input/output files. Faster to save/load.
- Reduced memory footprint and increased speed with read filtering.
And other improvements:
- IUPAC uncertainty codes for multiple copy split k-mers.
- Uncertainty with self-reverse-complement split k-mers (palindromes).
- Fully dynamic files (merge, delete samples).
- Native VCF output for map.
- Support for known strand sequence (e.g. RNA viruses).
- Stream to STDOUT, or file with
-o
. - Simpler command line combining
ska fasta
,ska fastq
,ska alleles
andska merge
into the newska build
. - Option for single commands to run
ska align
orska map
. - New coverage model for filtering FASTQ files with
ska cov
. - Logging.
- CI testing.
All of which make ska.rust
run faster and with smaller file size and memory
footprint than the original.
- Sparse data structure which will reduce space and make parallelisation more efficient. Issue #47.
- 'fastcall' mode. Issue #52.
- Add support for ambiguity in VCF output (
ska map
). Issue #5. - Non-serial loading of .skf files (for when they are very large). Issue #22.
- Alternative mixture models for read error correction. Issue #50.
- Use k > 63 (shouldn't be necessary? Let us know if you need this and why).
ska annotate
(use bedtools).ska compare
,ska humanise
,ska info
orska summary
(replaced byska nk --full-info
).ska unique
(you can parseska nk --full-info
if you want this functionality, but we didn't think it's used much).ska type
(use PopPUNK instead of MLST 🙂)- Ns are always skipped, and will not be found in any split k-mers.
.skf
files are not backwards compatible with version 1.