- Introduction
- Install
- Getting Started
- Troubleshooting and FAQs
- Other information
- Legal and Compliance Information
- Updates and Release Notes
Table of contents generated with markdown-toc
This program is created to supply a user with the ability to quickly create distance matrices of allelic profiles through the use of parallel processing. The program prints pairwise distances in a "molten" or flat format where distances are labeled as "sample1 sample2 distance". But a utility to quickly convert this molten format to a distance matrix is built into go-cluster
. Created distance matrices can be used to cluster the data and output dendrograms. Querying a reference set of profiles against a larger set of profiles in parallel is supported as well.
[Matthew Wells] : matthew.wells@phac-aspc.gc.ca
No install is currently provided, the program must be compiled using either the go
toolchain or gogcc
compiler. This program of go was built using version 1.1.8 as listed in the go.mod
file. Required dependencies are listed in the go.mod
file, however their installation is controlled by the go
toolchain. Instructions for installing go
can be found here go, Go can also likely be installed through other packaging programs like conda.
If you have go
installed, you can simply run go build
in the source directory and a binary file called go-cluster
will be generated.
go-cluster
has only been tested on linux.
The main sub-commands for the program are listed below.
Parallel Distances - A program for getting distances between allelic profiles and creating distance matrices.
Usage:
Parallel Distances [distances|convert|tree|fast-match]
Subcommands:
distances Compute all pairwise distances between the specified input profile.
convert Convert the pairwise distance generated by the program into a distance matrix.
tree Create a dendrogram from a supplied distance matrix.
fast-match Tabulate distances between a query profile and reference profiles. Only distances exceeding a threshold will be kept.
Flags:
--version Displays the program version string.
-h --help Displays help with available flag, subcommand, and positional value parameters.
distances - Compute all pairwise distances between the specified input profile.
Flags:
--version Displays the program version string.
-h --help Displays help with available flag, subcommand, and positional value parameters.
-i --input File path to your alleles profiles.
-l --load-factor This value is used to compute how many profile calculations are assigned to thread, a larger value will result in fewer threads being used. (default: 3)
-d --distance Enter an integer denoting the distance function you would like to use:
Hamming Distance skipping missing values: 0
Hamming distance missing values treated as alleles.: 1
Scaled Distance skipping missing values: 2
Scaled distance missing values treated as alleles.: 3 (default: 0)
-o --output Name of output file. If nothing is specified results will be sent to stdout.
-b --buffer-size The default buffer size is: 16384. Larger buffers may increase performance. (default: 16384)
-c --column-delimiter Column delimiter, default value is a tab character (default: )
-m --missing-allele-character String denoting missing alleles. (default: 0)
convert - Convert the pairwise distance generated by the program into a distance matrix.
Flags:
--version Displays the program version string.
-h --help Displays help with available flag, subcommand, and positional value parameters.
-i --input File path to a previously generated output for conversion into a distance matrix.
-o --output Name of output file. If nothing is specified results will be sent to stdout.
tree - Create a dendrogram from a supplied distance matrix.
Flags:
--version Displays the program version string.
-h --help Displays help with available flag, subcommand, and positional value parameters.
-i --input File path to previously generate distance matrix.
-o --output Name of output file. If nothing is specified results will be sent to stdout.
-l --linkage-method Please enter an integer corresponding to one of the linkage method of your choice: average: 0 centroid: 1 complete: 2 mcquitty: 3 median: 4 single: 5 ward: 6 (default: 0)
fast-match - Tabulate distances between a query profile and reference profiles. Only distances exceeding a threshold will be kept.
Flags:
--version Displays the program version string.
-h --help Displays help with available flag, subcommand, and positional value parameters.
-i --input File path to profiles for querying.
-r --reference File path to reference profiles to query against.
-c --column-delimiter Column delimiter (default: )
-m --missing-allele-character String denoting missing alleles. (default: 0)
-d --distance Enter an integer denoting the distance function you would like to use:
Hamming Distance: 0
Hamming distance skipping missing values: 1
Scaled Distance: 2
Scaled distance skipping missing values: 3 (default: 0)
-t --threshold Threshold for matching alleles. (default: 10.00)
-o --output Name of output file. If nothing is specified results will be sent to stdout.
-l --goroutine-limit Limit the number of goroutines run at one time. Default: 100 (default: 100)
The input file is a table with the first column containing a given samples name. While the subsequent columns are the names of the input loci. The program normalizes allele input first, so the actual content of the cells can vary and can be, integers, hashes or even nucleotides themselves (this would make the program run slower however). CSV files can be used, however you will have to specify the the delimiter to the command line arguments
Sample | Loci_1 | Loci_2 | Loci_3 |
---|---|---|---|
Sample1 | Loci_x | 5 | 1 |
Sample2 | 1 | xxxxxxxxxxxx | 1 |
Sample3 | 2 | 3 | 1 |
The pairwise distances output from the distances
option is the input to the distances
command. But if you are curious the output looks something like:
sample1 sample2 distance
sample1 sample3 distance
...
the output is a distance symmetric distance matrix.
The input is a symmetric distance matrix e.g.
1 2 3
1 0 40 40
2 40 0 40
3 40 40 0
and the output is a newick file.
The input for fast matching is two allelic profiles similar to what is passed when creating a distance matrix. There are some checks in the program to verify that you have the same number of columns but it is IMPORTANT That your alleles are in the same order for both of the compared profiles.
This program utilizes go routines
for parallel processing which allows for more threads than logical CPUs. Currently you cannot set the maximum number of CPUs to use in this program as go routines
will be distributed across CPUs at run time.
This can be an issue on grid executors like slurm
and it is best to run this program on an entire node of a cluster at a time. If you face this issue it is best to run the process on an entire node e.g. the -w
flag in slurm to select a node and adding the --exclusive
flag can solve this issue.
This program is still in development however basic end-to-end tests have been added. More code examples, benchmarks and unit-tests are still to be added. A lot of further optimizations can still be made to this program as well.
Copyright Government of Canada [2024]
Written by: National Microbiology Laboratory, Public Health Agency of Canada
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this work except in compliance with the License. You may obtain a copy of the License at:
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.
Please see the CHANGELOG.md
.