Skip to content
This repository has been archived by the owner on Jan 16, 2023. It is now read-only.

Latest commit

 

History

History
101 lines (77 loc) · 3.33 KB

README.md

File metadata and controls

101 lines (77 loc) · 3.33 KB

ExLSH logo

ExLSH

Build Status

Calculates a locality sensitive hash for text.

Locality-sensitive hashing is a technique for dimensionality reduction. Its properties guarantee similar output vectors for similar inputs. It can be used for clustering and near-duplicate detection. This implementation is targeted for natural language as input. It takes a String of arbitrary length and outputs a vector encoded as :binary.

If you want to learn more about why and how we built ExLSH, read our blog post Locality-sensitive Hashing in Elixir.

Installation

Add ex_lsh to your list of dependencies in mix.exs:

def deps do
  [
    {:ex_lsh, version: "~> 0.4"}
  ]
end

Usage

"Lorem ipsum dolor sit amet"
|> ExLSH.lsh()
|> Base.encode64()

Docs

see hexdocs.pm/ex_lsh

Contributions

Please fork the project and submit a PR.

Credits

  • SimHash is a very similar, but less versatile implementation that is focused on short strings only. ExLSH is approximately 7 times faster and supports arbitrary tokenization, shingling and hash functions.
  • SpiritFingers is ca. 27 times faster than ExLSH but relies on a NIF that needs the full Rust toolchain to compile. SpiritFingers doesn't support customization of the algorithm, it uses SipHash by default.
  • Resemblance explores simhash and sketching in Ruby. The author has documented his findings in a series of articles. You may want to make yourself familiar with Part 3: The SimHash Algorithm.
  • Near-duplicate detection is a very helpful article by Moz. It explains core concepts such as tokenization, shingling, MinHash, SimHash, etc.

Benchmark

Benchmark against SimHash, run with Benchee. See the setup on the benchmark branch.

Operating System: macOS
CPU Information: Intel(R) Core(TM) i7-4870HQ CPU @ 2.50GHz
Number of Available Cores: 8
Available memory: 16 GB
Elixir 1.8.1
Erlang 21.2.5

Benchmark suite executing with the following configuration:
warmup: 2 s
time: 5 s
memory time: 0 ns
parallel: 1
inputs: none specified
Estimated total run time: 21 s


Benchmarking ExLSH...
Benchmarking Simhash...
Benchmarking SpiritFingers...

Name                    ips        average  deviation         median         99th %
SpiritFingers       8556.15       0.117 ms    ±13.37%       0.111 ms       0.183 ms
ExLSH                309.61        3.23 ms     ±5.88%        3.19 ms        3.81 ms
Simhash               43.19       23.15 ms    ±12.57%       22.08 ms       30.54 ms

Comparison:
SpiritFingers       8556.15
ExLSH                309.61 - 27.64x slower
Simhash               43.19 - 198.11x slower