Skip to content

Changelog: 4.0 kmer filtering improvements

kissake edited this page Nov 29, 2022 · 1 revision

Filtering kmers

A number of steps are taken to improve the collection of kmers from the original genome data in kSNP4 (over kSNP3). This improvement is small, but will add up as the number of genomes increases, or the size of the genomes increases.

First, a new version of jellyfish is being used which removes limitations on the kmer length (formerly 32), and which simplifies processing of the output counts by placing them in a single output file. This is better than previous jellyfish versions which could require merging if the data overflowed into multiple files.

Second, the output of the jellyfish count is statistically analyzed while it is being written to disk. This avoids having to read the data back from disk just to perform statistics on it.

Third, a common case for the input is a complete genome assembly. We optimize for this case (without significant cost to the worst case) in performing the statistical analysis by collecting data in such a way as to minimize memory usage and pre-calculate where possible by:

  • Counting the number of kmers with only one occurrence, rather than repeatedly recording a ‘1’ count in an array (saves memory during load), and
  • Incrementing the total count as data is read, to make calculating the average at the end trivial (sum / number of kmers) without re-processing the data, and
  • Testing for the common case by comparing the number of unique kmers with the total number of kmers processed to determine if the median value is one.

It is anticipated that ~ 99% of kmers in a complete genome will be unique if the kmer length is chosen correctly, which Kchooser4 should facilitate. In this case, well over 50% of the kmer counts will be 1, which means that the median count must be 1. Further if anywhere close to 99% of the frequencies are 1, the average is quite likely to be very close to 1, which means the cutoff frequency for kmers is usually 1 for complete genomes.

In the case where the input does not fall into the common case, the collected data is passed to a statistics library to process in detail, as it was before.

The statistics calculated are used to determine which (if any) kmers are insufficiently frequent to be considered. For the case where the statistical analysis indicates that unique kmers (kmers with a frequency of one) are to be included (again, the vast majority of the case for complete genomes and a correct kmer length), we skip filtering the output of jellyfish entirely, as no kmers are included in the output with a lower frequency than one.

The upshot of the above optimizations mean that only non-complete genomes cost noticeable time in statistical analysis or filtering out insufficiently frequent kmers, and that case costs the same time as kSNP3.

Clone this wiki locally