Skip to content

Partitioning

kissake edited this page Nov 29, 2022 · 1 revision

Partitioning

What is partitioning?

Partitioning is when you take a dataset, and break it into sections in such a way that you can parse the data in one section without needing to look at data in a different section.

Why do partitioning at all?

Partitioning helps in a few ways. First, by breaking the problem into multiple smaller chunks that don't depend on each other, you can go faster by working on multiple chunks at the same time. In particular, this makes it easier to use multiple CPUs to gain nearly a factor of N improvement in speed.

Breaking the problem into chunks can also help by reducing the computational resources (e.g. memory) required at any given time (e.g. if you have 10 cores and break the data into 100 chunks, you only process about 10% of the data at any one time, meaning that you use only 10% of the memory required to process all of the data at once, as long as the memory use of your algorithm is proportional to the data being processed.

Okay, that makes sense, how did kSNP3 partition the data?

If you have examined the temporary files generated by kSNP3.x, you may have noticed a bunch of files named AAAA.mer AAAC.mer.... TTTG.mer TTTT.mer (or the like). The data is partitioned based on a short prefix of each individual kmer. Two important things to notice about this:

  1. There are a LOT of files (256). This isn't in-and-of itself a problem; computer systems are designed to deal with lots of files, and this isn't even that many (tens of thousands is totally fine for even legacy filesystems like ext3), but creating, reading, processing, writing and deleting files is not free.

  2. These files vary significantly in size. Most are small, but quite a few are zero length for common datasets. This meant that every processing step had to start 256 times, process a small amount of data (or even none), and then quit.

What could be better?

The changes made were two-fold. It makes sense to process as much data as the computer can handle at one time. This means that partitions should be larger because the previous sizes were well under current computing capacities. In addition, they should be similarly sized because, if they aren't as large as possible, time is being wasted starting new chunks when it could be spent processing the data.

The amount of data a computer can handle at once for a CPU bound process is based on the number of CPUs. This means that instead of separating the data into 256 partitions, it would make sense to separate it into at least C partitions(*), where C is the number of CPU cores. Using C separate single-threaded processes each on one of C different partitions, we can process the entire dataset at once, saturating the CPU capacity of the system.

We want to separate the input data into at least C different partitions, but we also want those partitions to be close to the same size so that all of the CPUs are as active as possible throughout the process (if one finishes a lot earlier, the others are just picking up its slack and the process is less efficient). This doesn't need to be exact, but we want it to be close.

(*) There is a little hand-waving here; This process can be memory bound as well as CPU bound. Let us know if you have that experience.

How it is done

We created a new program titled 'guessPartition'. It is passed a representative input file and then number of partitions (N) to create. Its function is to read a small number (100,000) of unsorted kmers (as you might see in the output of jellyfish), and use that as a representative sample of kmers. It will identify N-1 kmer prefix values (no longer than (k/2)-1, to avoid splitting SNPs across two different partitions) to use to partition the data into N approximately equal partitions. This program runs very quickly (well under a second).

The key is that the separator is much more granular (longer), so it can separate the input data more finely, but separating this way means that an arbitrary number of different prefixes can be in any given partition, while ensuring that kmers comprising the same SNP are guaranteed to be in the same partition. (**)

Then we created a second new program 'partitionKmers' to partition the data. It uses the prefix values generated by 'guessPartition', and puts each line of source data into a different file based on which partition the kmer in that source line belongs in. This takes a little longer than 'guessPartition', as more data is being processed, but it is primarily I/O bound. As long as the process you are feeding is sufficiently expensive in CPU (like finding SNPs), this is a win.

(**) I expect this to be sufficient for the vast majority of uses (including well beyond 256 partitions), but please reach out if you reach edge cases here.