(Proof of concept) Reference-based microbial sequencing data compression using SBWT and k-bounded matching statistics.
git clone https://github.com/tmaklin/ntcomp
cd ntcomp
cargo build --release
The built binary is located at target/release/ntcomp
.
wget https://a3s.fi/maklinto-2006181-pub/ntcomp-test.tar
tar -xf ntcomp-test.tar
Build from a reference sequence
ntcomp build -o index -k91 test/GCA_964037205.1_30348_1_60_genomic.fna.gz
Larger values of -k
produce better compression ratios at the cost of larger .lcs file size.
Query fastX data against an index and write the encoding
ntcomp encode --index index test/ERR10498075.fastq.gz > encoded.dat
Encoding data requires both the .sbwt and .lcs files.
This removes quality scores.
Retrieve the encoded sequences from an index
ntcomp decode --index index encoded.dat > decoded.fasta
In theory decoding requires only the .sbwt file but the tool will not run without the .lcs file present.
Decoding will also work with a rebuilt SBWT, as the construction algorithm is deterministic.
Requires installing seqtk
seqtk seq -A test/ERR10498075.fastq.gz | sed 's/ERR[0-9]*[.]\([0-9]*\).*$/seq.\1/g' > expected.fasta
diff -s decoded.fasta expected.fasta
Works by finding the longest common suffix matches between the k-mers in an input nucleotide sequence and an SBWT index and encoding the input as pairs of (longest common suffix length
and suffix sequence location in the SBWT
).
If the index is a close match to the sequencing reads, such as an assembly from the same reads, and the reads are reasonably accurate, many of the k-mers in the reads are redundant and looking them up from the index uses less storage compared to storing their entire sequence.
ntcomp is dual-licensed under the MIT and Apache 2.0 licenses.