Cython implementation of minimum_hash is 3x faster

I know this is a reference implementation, but it's the one that uses the `iscc` package name on PyPI and as such will be the most used.  Thus, here's a big performance suggestion.

The bottleneck in hashing long texts is minimum_hash.  This is a much faster version.  Compile with

> CFLAGS="-O3" cythonize -a -i utils.pyx

This is the code:
```
# cython: language_level=3

from cython.view cimport array

MINHASH_PERMUTATIONS = [
    # Copy values from iscc/const.py
]

cpdef unsigned int[:] cython_minimum_hash(unsigned int[:] features, int n=64):
    cdef unsigned long max_int64 = (1 << 64) - 1
    cdef unsigned long mersenne_prime = (1 << 61) - 1
    cdef unsigned long max_hash = (1 << 32) - 1
    cdef unsigned long a, b, min_val
    cdef unsigned int[:] result = array(shape=(n,), itemsize=sizeof(unsigned int), format="I")

    cdef int i, j
    for i in range(n):
        a, b = MINHASH_PERMUTATIONS[i]        
        min_val = max_int64
        for j in range(features.shape[0]):
            v = ((a * features[j] + b) & max_int64) % mersenne_prime
            min_val = min(min_val, v & max_hash)
        result[i] = <unsigned int>min_val
    return result
```

You can also use setup.py to build the module:
```
from distutils.core import setup
from distutils.extension import Extension
from Cython.Build import cythonize

ext_modules = [
    Extension(
        "utils",
        ["utils.pyx"],
        extra_compile_args=['-O3'],
        language='c',
    )
]

setup(
    ext_modules = cythonize(
        ext_modules,
        compiler_directives={
            'language_level' : '3',
            'linetrace': False,
            'binding': False
        }
    )
)
```

Until it's integrated, I monkey patch it like so:
```
from utils import cython_minimum_hash

def fast_minimum_hash(generator, n=64):
    array = np.array(list(generator), dtype=np.uint32)
    result_fast = cython_minimum_hash(array, n)
    return result_fast

iscc.iscc.minimum_hash = fast_minimum_hash
```

The last function is a good place to insert comparison of the results with the slow version. I hashed hundreds of long documents to check they are identical.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cython implementation of minimum_hash is 3x faster #98

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Cython implementation of minimum_hash is 3x faster #98

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions