Skip to content

Man page: guessPartition

kissake edited this page Oct 21, 2022 · 1 revision

guessPartition

NAME

SYNOPSIS

Usage

guessPartition [-h] [--all] [--samples SAMPLES] [--input INPUT] [--debug] partitions

DESCRIPTION

Overview

Guesses good values for partitioning lists of kmers based on a subset of example data.

Options

  • **partitions - The number of buckets of data you want to generate. The output will be a series of kmer prefixes one fewer than the number of partitions.

  • --input INPUT - The input file to read from (overrides the default: STDIN)

  • --samples SAMPLES - Override the default number of samples.

  • --all - Parse all values in the source file. Useful if the source file is sorted.

  • --debug - Output diagnostic data to STDERR

  • -h, --help - Show this help message and exit

Note

This program requires UNSORTED input for it to function with sampling. Otherwise you should use --all to ensure that you don't get worst-case behavior.

Details

This program will by default read the first 100,000 lines, interpreting the first whitespace separated string as a kmer, sort it, and then evenly divide it into the number of partitions requested. At the edge of each partition, it will output the first (k/2)-1 characters of the kmer present at the dividing line as the output prefix.

If and to the extent that these are representative kmers, this will create partition separators that will separate kmers into similarly sized buckets.

The value 100,000 was picked empirically; on my system this command will return partition information from an unsorted input in well under a second. The intention is to quickly determine how to break up large files to make the buckets / partitions approximately the same size; not to get them to be exactly the same size.

It is likely important to use the same partition across all input files, to ensure that all each input file's partitions match with the other files. If you find yourself running this program multiple times within a single job, I'd be interested to hear about the reasons for that.

EXAMPLE

guessPartition --input kmers.fsplit0 8 > partitions

SEE ALSO