Skip to content

Man page: partitionKmers

kissake edited this page Oct 21, 2022 · 1 revision

partitionKmers (1)

NAME

partitionKmers - Separate kmers in an input file into a arbitrary number of different output files such that each output file will cover a subset of potential SNPs completely.

SYNOPSIS

Usage

partitionKmers [-h] [--partition PARTITION] [--suffix SUFFIX] [--col COL] [--debug] kmerFile [kmerFile ...]

DESCRIPTION

Overview

Take input files containing lists of kmers and distribute their contents line-by-line to different sub-files (buckets) based on the kmer value, such that like kmers are in the same bucket.

Options

  • kmerFile - The input file to be filtered into multiple partitions.

  • --partition PARTITION - The partition file contains (one-per-line) the kmer prefix to use as the divider between files. If this option is not provided, partitions will be read from standard input.

  • --suffix SUFFIX - The suffix to use before the number indicating which part is being output.

  • --col COL - The (whitespace separated) column where kmers are to be found in each line of input data. Data will be partitioned based on the value of this column.

  • --debug - Output diagnostic data to the screen using STDERR.

  • -h, --help - Show a brief usage and help message and exit

Details

The partition file is a list of kmer prefixes, with one fewer lines than the total number of output files.

The input file is arbitrary line-based records, with a kmer in a defined whitespace separated column.

partition_kmers will read each input file line (record), locate the kmer within the line, determine which partition it belongs in (which kmer prefix from the partition file it is less than, if any), and write the entire line (record) to the output file with the name: , where is a zero indexed identifier of the largest prefix it is less than (if the prefixes are counted in sorted order)

Note

A partition file must not provide separators of length greater than (k/2)-1 for k length kmers. Failing to observe this limit will likely result in kmers from the same SNP being sorted into different output files, which will likely be an error.

EXAMPLE

To partition the file fsplit0.kmer into fsplit0.kmer.part0 and fsplit0.kmer.part1, given the file partition.txt that consists of a single line with the value 'CACCGTACAT':

partitionKmers --partition partitions.txt --suffix .part kmers.fsplit5

SEE ALSO

  • kSNP4
  • find_snps
  • guess_partitions