ntcomp

(Proof of concept) Reference-based microbial sequencing data compression using SBWT and k-bounded matching statistics.

Installation

git clone https://github.com/tmaklin/ntcomp
cd ntcomp
cargo build --release

The built binary is located at target/release/ntcomp.

Usage

Setup example data

wget https://a3s.fi/maklinto-2006181-pub/ntcomp-test.tar
tar -xf ntcomp-test.tar

Preparing an index

Build from a reference sequence

ntcomp build -o index -k91 test/GCA_964037205.1_30348_1_60_genomic.fna.gz

Larger values of -k produce better compression ratios at the cost of larger .lcs file size.

Encoding fastX data

Query fastX data against an index and write the encoding

ntcomp encode --index index test/ERR10498075.fastq.gz > encoded.dat

Encoding data requires both the .sbwt and .lcs files.

This removes quality scores.

Decoding encoded fastX data

Retrieve the encoded sequences from an index

ntcomp decode --index index encoded.dat > decoded.fasta

In theory decoding requires only the .sbwt file but the tool will not run without the .lcs file present.

Decoding will also work with a rebuilt SBWT, as the construction algorithm is deterministic.

Verify

Requires installing seqtk

seqtk seq -A test/ERR10498075.fastq.gz | sed 's/ERR[0-9]*[.]\([0-9]*\).*$/seq.\1/g' > expected.fasta
diff -s decoded.fasta expected.fasta

About

Works by finding the longest common suffix matches between the k-mers in an input nucleotide sequence and an SBWT index and encoding the input as pairs of (longest common suffix length and suffix sequence location in the SBWT).

If the index is a close match to the sequencing reads, such as an assembly from the same reads, and the reads are reasonably accurate, many of the k-mers in the reads are redundant and looking them up from the index uses less storage compared to storing their entire sequence.

License

ntcomp is dual-licensed under the MIT and Apache 2.0 licenses.

Name		Name	Last commit message	Last commit date
Latest commit History 52 Commits
src		src
tests		tests
.gitignore		.gitignore
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
LICENSE-APACHE		LICENSE-APACHE
LICENSE-MIT		LICENSE-MIT
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Licenses found

Uh oh!

Repository files navigation

ntcomp

Installation

Usage

Setup example data

Preparing an index

Encoding fastX data

Decoding encoded fastX data

Verify

About

License

About

Licenses found

Uh oh!

Releases

Languages

License

Licenses found

tmaklin/ntcomp

Folders and files

Latest commit

History

Repository files navigation

ntcomp

Installation

Usage

Setup example data

Preparing an index

Encoding fastX data

Decoding encoded fastX data

Verify

About

License

About

Topics

Resources

License

Licenses found

Uh oh!

Stars

Watchers

Forks

Releases

Languages