-
Notifications
You must be signed in to change notification settings - Fork 8
Open
Description
I know this is a reference implementation, but it's the one that uses the iscc package name on PyPI and as such will be the most used. Thus, here's a big performance suggestion.
The bottleneck in hashing long texts is minimum_hash. This is a much faster version. Compile with
CFLAGS="-O3" cythonize -a -i utils.pyx
This is the code:
# cython: language_level=3
from cython.view cimport array
MINHASH_PERMUTATIONS = [
# Copy values from iscc/const.py
]
cpdef unsigned int[:] cython_minimum_hash(unsigned int[:] features, int n=64):
cdef unsigned long max_int64 = (1 << 64) - 1
cdef unsigned long mersenne_prime = (1 << 61) - 1
cdef unsigned long max_hash = (1 << 32) - 1
cdef unsigned long a, b, min_val
cdef unsigned int[:] result = array(shape=(n,), itemsize=sizeof(unsigned int), format="I")
cdef int i, j
for i in range(n):
a, b = MINHASH_PERMUTATIONS[i]
min_val = max_int64
for j in range(features.shape[0]):
v = ((a * features[j] + b) & max_int64) % mersenne_prime
min_val = min(min_val, v & max_hash)
result[i] = <unsigned int>min_val
return result
You can also use setup.py to build the module:
from distutils.core import setup
from distutils.extension import Extension
from Cython.Build import cythonize
ext_modules = [
Extension(
"utils",
["utils.pyx"],
extra_compile_args=['-O3'],
language='c',
)
]
setup(
ext_modules = cythonize(
ext_modules,
compiler_directives={
'language_level' : '3',
'linetrace': False,
'binding': False
}
)
)
Until it's integrated, I monkey patch it like so:
from utils import cython_minimum_hash
def fast_minimum_hash(generator, n=64):
array = np.array(list(generator), dtype=np.uint32)
result_fast = cython_minimum_hash(array, n)
return result_fast
iscc.iscc.minimum_hash = fast_minimum_hash
The last function is a good place to insert comparison of the results with the slow version. I hashed hundreds of long documents to check they are identical.
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels