A lightweight, fast dbscan implementation for use on peptide strings. It uses pure C for the distance calculations and clustering. This code is then wrapped in python.
Note: as implemented, the software assumes all sequences have the same length.
pip3 install fast_dbscan
git clone https://github.com/harmslab/fast_dbscan
cd fast_dbscan
sudo python3 setup.py install
This will install a convenience program called fast_dbscan
in the path. This
can be invoked on the command line:
fast_dbscan filename epsilon [dl]
where filename
is a file that contains sequences of identical length, with one
per line, epsilon
is the neighborhood distance cutoff (see below), and the
optional argument dl
says to use the Damerau-Levenshtein distance function
rather than the simple distance function.
import fast_dbscan
d = fast_dbscan.DBScanWrapper(distance_function='dl')
d.read_file(file_with_sequences)
d.run(epsilon=1,min_neighbors=12)
# Dictionary keying cluster id to sequences
clusters = d.results
simple
: add up entries in a distance matrix based on the identies of letters at each column in the alignment. Currently, the software uses hamming distance. This could be easily modified to use other matricies, provided distances can be calculated as integers. The matrix is populated inDBScanWrapper.__init__
.dl
: Damerau-Levenshtein distance, allowing deletion, insertion, substitution, and transposition.
epsilon
: the maximum distance between two samples for them to be considered within the same neighborhood.min_neighbors
: the minimum number of sequence neighbors required to define a cluster