Skip to content

Commit

Permalink
Add max_distance (#124)
Browse files Browse the repository at this point in the history
- distance functions have a `max_distance` argument
- default value is max value for that type of int (i.e. unchanged)
- CPU distance implementations now return early if the distance reached max_distance
  - for max_distance < 255 we do more intermediate summing of distances to allow early return -> slight reduction in performance if no early return
- add benchmarks and tests
- bump version
- resolves #117
  • Loading branch information
lkeegan authored Feb 16, 2024
1 parent 5919c48 commit 743dc90
Show file tree
Hide file tree
Showing 31 changed files with 460 additions and 278 deletions.
4 changes: 2 additions & 2 deletions .pre-commit-config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@ repos:
- id: mixed-line-ending

- repo: https://github.com/psf/black
rev: 23.12.1
rev: 24.2.0
hooks:
- id: black

Expand All @@ -29,7 +29,7 @@ repos:
- id: prettier

- repo: https://github.com/python-jsonschema/check-jsonschema
rev: 0.27.3
rev: 0.28.0
hooks:
- id: check-github-workflows
- id: check-readthedocs
Expand Down
2 changes: 1 addition & 1 deletion CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@ cmake_minimum_required(VERSION 3.23..3.27)

project(
hammingdist
VERSION 1.0.0
VERSION 1.1.0
LANGUAGES CXX)

include(CTest)
Expand Down
10 changes: 5 additions & 5 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -89,10 +89,10 @@ import hammingdist
sequence_indices = hammingdist.fasta_sequence_indices(fasta_file)
```

## Large distance values
## Maximum distance values

By default, the elements in the distances matrix returned by `hammingdist.from_fasta` have a maximum value of 255.

You can also set a smaller maximum value using the `max_distance` argument.
For distances larger than this `hammingdist.from_fasta_large` supports distances up to 65535 (but uses twice as much RAM)

## Distances from reference sequence
Expand Down Expand Up @@ -122,12 +122,12 @@ On linux hammingdist is built with OpenMP (multithreading) support, and will aut
## CUDA on linux

On linux hammingdist is also built with CUDA (Nvidia GPU) support.
To use the GPU instead of the CPU, set `use_gpu=True` when calling `from_fasta`:
To use the GPU instead of the CPU, set `use_gpu=True` when calling `from_fasta`. Here we also set the maximum distance to 2:

```python
import hammingdist

data = hammingdist.from_fasta("example.fasta", use_gpu=True)
data = hammingdist.from_fasta("example.fasta", use_gpu=True, max_distance=2)
```

Additionally, the lower triangular matrix file can now be directly constructed from the fasta file
Expand All @@ -138,7 +138,7 @@ which means it requires less RAM and runs faster.
```python
import hammingdist

hammingdist.from_fasta_to_lower_triangular('input_fasta.txt', 'output_lower_triangular.txt', use_gpu=True)
hammingdist.from_fasta_to_lower_triangular('input_fasta.txt', 'output_lower_triangular.txt', use_gpu=True, max_distance=2)
```

![overview](plots/speed.png)
Expand Down
1 change: 1 addition & 0 deletions ext/CMakeLists.txt
Original file line number Diff line number Diff line change
@@ -1,3 +1,4 @@
set(FMT_INSTALL OFF)
add_subdirectory(fmt)

if(HAMMING_BUILD_PYTHON)
Expand Down
4 changes: 3 additions & 1 deletion include/hamming/distance_avx2.hh
Original file line number Diff line number Diff line change
@@ -1,13 +1,15 @@
#pragma once

#include <cstdint>
#include <limits>
#include <vector>

#include "hamming/hamming_impl_types.hh"

namespace hamming {

int distance_avx2(const std::vector<GeneBlock> &a,
const std::vector<GeneBlock> &b);
const std::vector<GeneBlock> &b,
int max_dist = std::numeric_limits<int>::max());

}
4 changes: 3 additions & 1 deletion include/hamming/distance_avx512.hh
Original file line number Diff line number Diff line change
Expand Up @@ -4,10 +4,12 @@
#include <vector>

#include "hamming/hamming_impl_types.hh"
#include <limits>

namespace hamming {

int distance_avx512(const std::vector<GeneBlock> &a,
const std::vector<GeneBlock> &b);
const std::vector<GeneBlock> &b,
int max_dist = std::numeric_limits<int>::max());

}
12 changes: 8 additions & 4 deletions include/hamming/distance_cuda.hh
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,7 @@

#include <cstdint>
#include <iostream>
#include <limits>
#include <vector>

#include "hamming/hamming_impl_types.hh"
Expand All @@ -11,17 +12,20 @@ namespace hamming {
bool distance_cuda_have_device();

int distance_cuda(const std::vector<GeneBlock> &a,
const std::vector<GeneBlock> &b);
const std::vector<GeneBlock> &b,
int max_dist = std::numeric_limits<int>::max());

// for now explicit function def for each choice of integer type
std::vector<uint8_t>
distances_cuda_8bit(const std::vector<std::vector<GeneBlock>> &data);
distances_cuda_8bit(const std::vector<std::vector<GeneBlock>> &data,
uint8_t max_dist = std::numeric_limits<uint8_t>::max());

std::vector<uint16_t>
distances_cuda_16bit(const std::vector<std::vector<GeneBlock>> &data);
distances_cuda_16bit(const std::vector<std::vector<GeneBlock>> &data,
uint16_t max_dist = std::numeric_limits<uint16_t>::max());

void distances_cuda_to_lower_triangular(
const std::vector<std::vector<GeneBlock>> &data,
const std::string &filename);
const std::string &filename, int max_distance);

} // namespace hamming
4 changes: 3 additions & 1 deletion include/hamming/distance_neon.hh
Original file line number Diff line number Diff line change
Expand Up @@ -4,10 +4,12 @@
#include <vector>

#include "hamming/hamming_impl_types.hh"
#include <limits>

namespace hamming {

int distance_neon(const std::vector<GeneBlock> &a,
const std::vector<GeneBlock> &b);
const std::vector<GeneBlock> &b,
int max_dist = std::numeric_limits<int>::max());

}
4 changes: 3 additions & 1 deletion include/hamming/distance_sse2.hh
Original file line number Diff line number Diff line change
Expand Up @@ -4,10 +4,12 @@
#include <vector>

#include "hamming/hamming_impl_types.hh"
#include <limits>

namespace hamming {

int distance_sse2(const std::vector<GeneBlock> &a,
const std::vector<GeneBlock> &b);
const std::vector<GeneBlock> &b,
int max_dist = std::numeric_limits<int>::max());

}
33 changes: 19 additions & 14 deletions include/hamming/hamming.hh
Original file line number Diff line number Diff line change
Expand Up @@ -19,10 +19,12 @@ template <typename DistIntType> struct DataSet {
explicit DataSet(std::vector<std::string> &data, bool include_x = false,
bool clear_input_data = false,
std::vector<std::size_t> &&indices = {},
bool use_gpu = false)
bool use_gpu = false,
int max_distance = std::numeric_limits<int>::max())
: nsamples(data.size()), sequence_indices(std::move(indices)) {
validate_data(data);
result = distances<DistIntType>(data, include_x, clear_input_data, use_gpu);
result = distances<DistIntType>(data, include_x, clear_input_data, use_gpu,
max_distance);
}

explicit DataSet(const std::string &filename) {
Expand Down Expand Up @@ -99,9 +101,10 @@ template <typename DistIntType> struct DataSet {
std::vector<std::size_t> sequence_indices{};
};

DataSet<DefaultDistIntType> from_stringlist(std::vector<std::string> &data,
bool include_x = false,
bool use_gpu = false);
DataSet<DefaultDistIntType>
from_stringlist(std::vector<std::string> &data, bool include_x = false,
bool use_gpu = false,
int max_distance = std::numeric_limits<int>::max());

DataSet<DefaultDistIntType> from_csv(const std::string &filename);

Expand All @@ -122,19 +125,21 @@ DataSet<DistIntType> from_lower_triangular(const std::string &filename) {
}

template <typename DistIntType>
DataSet<DistIntType> from_fasta(const std::string &filename,
bool include_x = false,
bool remove_duplicates = false,
std::size_t n = 0, bool use_gpu = false) {
DataSet<DistIntType>
from_fasta(const std::string &filename, bool include_x = false,
bool remove_duplicates = false, std::size_t n = 0,
bool use_gpu = false,
int max_distance = std::numeric_limits<int>::max()) {
auto [data, sequence_indices] = read_fasta(filename, remove_duplicates, n);
return DataSet<DistIntType>(data, include_x, true,
std::move(sequence_indices), use_gpu);
std::move(sequence_indices), use_gpu,
max_distance);
}

void from_fasta_to_lower_triangular(const std::string &input_filename,
const std::string &output_filename,
bool remove_duplicates = false,
std::size_t n = 0, bool use_gpu = false);
void from_fasta_to_lower_triangular(
const std::string &input_filename, const std::string &output_filename,
bool remove_duplicates = false, std::size_t n = 0, bool use_gpu = false,
int max_distance = std::numeric_limits<int>::max());

ReferenceDistIntType distance(const std::string &seq0, const std::string &seq1,
bool include_x = false);
Expand Down
37 changes: 22 additions & 15 deletions include/hamming/hamming_impl.hh
Original file line number Diff line number Diff line change
Expand Up @@ -28,25 +28,30 @@ inline bool cuda_gpu_available() {
}

typedef int (*distance_func_ptr)(const std::vector<GeneBlock> &,
const std::vector<GeneBlock> &);
const std::vector<GeneBlock> &, int);

distance_func_ptr get_fastest_supported_distance_func();

std::array<GeneBlock, 256> lookupTable(bool include_x = false);

template <typename DistIntType> DistIntType safe_int_cast(int x) {
if (x > std::numeric_limits<DistIntType>::max()) {
return std::numeric_limits<DistIntType>::max();
template <typename DistIntType>
DistIntType
safe_int_cast(int x,
DistIntType max_x = std::numeric_limits<DistIntType>::max()) {
if (x > max_x) {
return max_x;
}
return static_cast<DistIntType>(x);
}

void validate_data(const std::vector<std::string> &data);

int distance_sparse(const SparseData &a, const SparseData &b);
int distance_sparse(const SparseData &a, const SparseData &b,
int max_dist = std::numeric_limits<int>::max());

int distance_cpp(const std::vector<GeneBlock> &a,
const std::vector<GeneBlock> &b);
const std::vector<GeneBlock> &b,
int max_dist = std::numeric_limits<int>::max());

std::vector<SparseData> to_sparse_data(const std::vector<std::string> &data,
bool include_x);
Expand All @@ -63,8 +68,9 @@ std::vector<GeneBlock> from_string(const std::string &str);
template <typename DistIntType>
std::vector<DistIntType> distances(std::vector<std::string> &data,
bool include_x, bool clear_input_data,
bool use_gpu) {
bool use_gpu, int max_distance) {
std::vector<DistIntType> result((data.size() - 1) * data.size() / 2, 0);
auto max_dist = safe_int_cast<DistIntType>(max_distance);
auto start_time = std::chrono::high_resolution_clock::now();
auto print_timing = [&start_time](const std::string &event,
bool final = false) {
Expand Down Expand Up @@ -118,13 +124,14 @@ std::vector<DistIntType> distances(std::vector<std::string> &data,
}
print_timing("pre-processing");
#ifdef HAMMING_WITH_OPENMP
#pragma omp parallel for default(none) shared(result, sparse, nsamples)
#pragma omp parallel for default(none) \
shared(result, sparse, nsamples, max_dist)
#endif
for (std::size_t i = 0; i < nsamples; ++i) {
std::size_t offset{i * (i - 1) / 2};
for (std::size_t j = 0; j < i; ++j) {
result[offset + j] =
safe_int_cast<DistIntType>(distance_sparse(sparse[i], sparse[j]));
result[offset + j] = safe_int_cast<DistIntType>(
distance_sparse(sparse[i], sparse[j], max_dist));
}
}
print_timing("distance calculation", true);
Expand All @@ -142,9 +149,9 @@ std::vector<DistIntType> distances(std::vector<std::string> &data,
std::cout << "# hammingdist :: Using GPU..." << std::endl;
print_timing("pre-processing");
if constexpr (sizeof(DistIntType) == 1) {
return distances_cuda_8bit(dense);
return distances_cuda_8bit(dense, max_dist);
} else if constexpr (sizeof(DistIntType) == 2) {
return distances_cuda_16bit(dense);
return distances_cuda_16bit(dense, max_dist);
} else {
throw std::runtime_error("No GPU implementation available");
}
Expand All @@ -155,13 +162,13 @@ std::vector<DistIntType> distances(std::vector<std::string> &data,
print_timing("pre-processing");
#ifdef HAMMING_WITH_OPENMP
#pragma omp parallel for schedule(static, 1) default(none) \
shared(result, dense, nsamples, distance_func)
shared(result, dense, nsamples, distance_func, max_dist)
#endif
for (std::size_t i = 0; i < nsamples; ++i) {
std::size_t offset{i * (i - 1) / 2};
for (std::size_t j = 0; j < i; ++j) {
result[offset + j] =
safe_int_cast<DistIntType>(distance_func(dense[i], dense[j]));
result[offset + j] = safe_int_cast<DistIntType>(
distance_func(dense[i], dense[j], max_dist));
}
}
print_timing("distance calculation", true);
Expand Down
2 changes: 1 addition & 1 deletion pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@ build-backend = "scikit_build_core.build"

[project]
name = "hammingdist"
version = "1.0.0"
version = "1.1.0"
description = "A fast tool to calculate Hamming distances"
readme = "README.md"
license = {text = "MIT"}
Expand Down
15 changes: 9 additions & 6 deletions python/hammingdist.cc
Original file line number Diff line number Diff line change
Expand Up @@ -60,25 +60,28 @@ PYBIND11_MODULE(hammingdist, m) {
m.def("from_fasta", &from_fasta<uint8_t>, py::arg("filename"),
py::arg("include_x") = false, py::arg("remove_duplicates") = false,
py::arg("n") = 0, py::arg("use_gpu") = false,
py::arg("max_distance") = 255,
"Creates a dataset by reading from a fasta file (assuming all "
"sequences have equal length). Maximum value of an element in the "
"distances matrix: 255. Distances that would have been larger than "
"this value saturate at 255 - to support genomes with larger distances "
"than this see `from_fasta_large` instead.");
"distances matrix: max_distance or 255, whichever is lower."
"Distances that would have been larger than "
"this value instead saturate at this value - to support genomes with "
"larger distances than this see `from_fasta_large` instead.");
m.def("from_fasta_large", &from_fasta<uint16_t>, py::arg("filename"),
py::arg("include_x") = false, py::arg("remove_duplicates") = false,
py::arg("n") = 0, py::arg("use_gpu") = false,
py::arg("max_distance") = 65535,
"Creates a dataset by reading from a fasta file (assuming all "
"sequences have equal length). Maximum value of an element in the "
"distances matrix: 65535");
"distances matrix: max_distance or 65535, whichever is lower");
m.def("from_fasta_to_lower_triangular", &from_fasta_to_lower_triangular,
py::arg("fasta_filename"), py::arg("output_filename"),
py::arg("remove_duplicates") = false, py::arg("n") = 0,
py::arg("use_gpu") = true,
py::arg("use_gpu") = true, py::arg("max_distance") = 65535,
"Construct lower triangular distances matrix output file from the "
"fasta file,"
"requires an NVIDIA GPU. Maximum value of an element in "
"the distances matrix: 65535");
"the distances matrix: max_distance or 65535, whichever is lower");
m.def("from_lower_triangular", &from_lower_triangular<uint8_t>,
"Creates a dataset by reading already computed distances from lower "
"triangular format. Maximum value of an element in the distances "
Expand Down
Loading

0 comments on commit 743dc90

Please sign in to comment.