-
Notifications
You must be signed in to change notification settings - Fork 0
Changelog: 4.0 Improvements in discovering SNPs
In kSNP3 the problem of discovering the Single Nucleotide Polymorphisms (SNPs) was broken down into simple individual steps that could be performed with standard Unix utilities and short scripts. This, plus partitioning the data into 256 pieces, meant that individual steps went pretty fast, and the problem could be spread across all of the CPUs.
One of the challenges was that there were so many individual steps and pieces that it would take thousands of commands to process even a limited number of genomes. Modern operating systems are fast, but it still takes time to start up a new process
The next challenge was that, likely due to the process being built in pieces, and the authors taking care to check at each step when developing to make sure the data was correct, there were extraneous steps. For example, the data was sorted in multiple places along the path. While sorting a small file goes pretty quickly, doing that for 256 small files, repeatedly, does end up taking some time in the long run.
Next, because all of the partitions were processed through the first step before the second step started, all of the results of the first step needed to be written to disk successfully before almost immediately being read from disk again for the second step (and likewise for successive steps). This meant a lot of disk activity that could be avoided by making sure that all of the steps were completed within one partition one after the other, without needing to write them to disk.
Finally, every time the data is processed (i.e. for each step), it needed to be parsed and then loaded into an in-memory data structure, which was then discarded once that step was complete.
My approach to this challenge was to parse the data from disk into memory once. Once loaded, the data could be processed one candidate-SNP at a time, completely processing that candidate, including a (buffered) write to disk before moving on to the next.
In addition, I made some choices as to data structure intended to permit very quick processing of individual SNP candidates. Those choices:
- The candidates are stored in a hash table to permit fast random access for ingest (the kmers being ingested are unsorted and are from different genomes). The key for this hash table is the kmer with the center mer masked out so that all SNP candidates will hash to the same record. This is a data structure used in kSNP3.
- A SNP candidate record is stored as a hash table of integers (note that in Python version 3, integers are of arbitrary size, limited only by available RAM), with the hash key being the center mer corresponding to a kmer from a genome. Thus a given kmer maps to a single SNP candidate, and within that candidate to a single integer.
- The integer is a bit-field where each bit represents an input genome. The result of this is that:
- Ingest, which is done genome-by-genome, is two successive hash table lookups, followed by a bitwise and operation.
- All conflicts (where two kmers differing only in the center mer occur in the same genome) can be calculated with 12 bitwise operations (one bitwise and operation and then a bitwise or operation for each of 6 combinations of center mer (AC, AG, AT, CG, CT, GT).
- The common case (no SNP) is then easily filtered out with five additional bitwise operations masking conflicts from each center mer bitfield, and then counting the number of non-zero results.
These operations are performed in Python, and are sufficiently fast at this time to represent a significant improvement in speed. However, this implementation can possibly be tuned further, including by:
- Replacing the hash table of center mers with an array,
- Replacing the central loop with optimized C or assembly for the common case (no SNP).
- Identifying potential common cases that permit detection of non-SNP earlier, including:
- Identical kmer across multiple genomes (no polymorphism)
- Kmers differing in central nucleotide that occur only in a single genome (conflict with no chance of polymorphism)
This program writes found SNPs out to per-genome files in two different formats. To avoid opening and closing files very often, it opens all output files at one time. For large numbers of genomes, this will hit OS limits. You will likely need to use the ulimit
command for your OS when processing large numbers of genomes in order to increase the permitted number of concurrently open files.
As implemented today, kSNP4 will process all kmer data at one time. This will have an impact on performance (and specifically memory usage). If your input data does not completely fit in memory, this step will result in significant swapping, and could cause the (for Linux) OOM killer to kill your process, or some other process that you hold very dear. This is an item we plan to improve in future revisions.