Skip to content

Man page: find_snps

kissake edited this page Oct 21, 2022 · 2 revisions

Manual pages in the style of man(1)

find_snps (1)

NAME

find_snps - Identify all kmers that represent likely single-nucleotide polymorphisms (SNPs) between supplied genomes.

SYNOPSIS

Usage:

find_snps [-h] [--fasta-suffix FASTA_SUFFIX] [--snp-suffix SNP_SUFFIX] [--snps-all SNPS_ALL] [--debug] [--info] [--quiet] filteredGenome filteredGenome [filteredGenome ...]

DESCRIPTION

Overview

Takes a set (or matching subset) of kmers associated with two or more genomes, and produces two output files for each genome containing a list of SNPs found within that genome. One output is in a FASTA style file format (suitable for e.g. MUMmer), the other is in a SNPs style file format (suitable for concatenating with similar files).

Optionally, the program can accept a set of SNPs found earlier, presumably from different genomes, and determine if there are kmers in any supplied genome data that are polymorphic to those previously found SNPs.

Options

  • filteredGenome - Filename of a file containing canonical kmers associated with a genome. This file need not represent all of the kmers associated with the genome, as long as it is certain to contain all kmers that could belong to a given SNP (see Kmer Filtering discussion below). One way to achieve that is by ensuring all kmers with the same prefix (one that is short enough that it does not contain the central nucleotide) are in the same file.

  • --fasta-suffix FASTA_SUFFIX - This is the suffix to append to the filename provided in filteredGenome for the corresponding FASTA style output.

  • --snp-suffix SNP_SUFFIX - This is the suffix to append to the filename provided in filteredGenome for the corresponding SNPs style output.

  • --snps-all SNPS_ALL - If you have a pre-existing set of SNPs that you are interested in testing against the kmers to be provided, you can provide them using this argument. These SNPs should be filtered in the same way that the input kmers are (see Kmer Filtering discussion below)

  • --debug - This option causes find_snps to provide verbose information about the processing of kmers into SNPs. This may be useful for identifying issues within the program (but it may not).

  • --info - This option causes find_snps to provide status information as it processes SNPs. This is the level that kSNP4 uses as of this writing.

  • --quiet - Only alert to warning / unexpected occurrences; otherwise, should produce no output on the terminal.

  • -h - A brief summary of program options.

Kmer Filtering / Partitioning

This program is single-threaded. However, the data to be processed can be partitioned so that this program can be run multiple times in parallel, to achieve significant performance gains and to limit resource usage.

For correct function, all kmers that could be associated with a given SNP must be in the same call to find_snps in order for the program to operate correctly. This means that if you are partitioning the kmer data into multiple files, you must not split across the central nucleotide (e.g. AAAAAAA must be in the same run of find_snps as AAATAAA).

This means that if you are filtering kmers into partitions or buckets based on being in the same prefix (e.g. all kmers starting with AAA in the same bucket), that prefix must NOT include the central nucleotide (i.e. for the example of AAATAAA, the prefix cannot be AAAT because that would put this kmer in a different partition / bucket from AAAAAAA).

Resources

CPU is the typical bottleneck when running this program, but two other resources are worth discussing:

File handles

This program will write data to two files for every genome it is processing. These files are all opened and being written to at the same time. When working with a large number of genomes (typically over a hundred), this can exceed (has exceeded) the maximum number of open files (file handles) set by the operating system for some users.

This limit may be adjustable using a command similar to this one (which will push this problem into the range of 5000 genomes, beyond where we have tested):

ulimit -n 10240

The actual limits imposed by your operating system may well be configurable in other ways, and you are encouraged to research this issue if you expect that you might hit some of these limits.

Memory

This program will hold all potential SNPs in memory, which means that the memory usage is likely proportional to the total number of unique kmers within all genome files provided on the commandline.

This program will perform sub-optimally if there is not enough memory and some of the used memory is paged out to disk (swapped out).

If you observe swapping or particularly poor performance of this program, you may try using smaller kmer partitions (see Kmer Filtering above)

EXAMPLES

find_snps --info --fasta-suffix .SNPs.fasta --snp-suffix .SNPs kmers.fsplit0.part6 kmers.fsplit1.part6 kmers.fsplit2.part6 kmers.fsplit3.part6 kmers.fsplit4.part6 kmers.fsplit5.part6 kmers.fsplit6.part6 kmers.fsplit7.part6 kmers.fsplit8.part6 kmers.fsplit9.part6

SEE ALSO