Skip to content

Commit

Permalink
📝 Update docs and set version to 0.18.0
Browse files Browse the repository at this point in the history
  • Loading branch information
hardbyte committed Apr 24, 2023
1 parent b5b169c commit d606c9a
Show file tree
Hide file tree
Showing 5 changed files with 34 additions and 23 deletions.
7 changes: 7 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,12 @@
## new version

## 0.18.0

- Performance improvements by caching hashes of tokens. #664
- Switch to using `blakeHash` for benchmarking. #664
- Remove implicit dependency on `setuptools`. #663
- Migrate to pyproject.toml for dependency management and packaging. #659

## 0.17.0

- Remove use of bitarray fork as upstream project now publishes wheels. #557, #567, #573
Expand Down
18 changes: 14 additions & 4 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,16 +1,22 @@
# CLK Hash

Python implementation of cryptographic longterm key hashing as described by Rainer Schnell, Tobias Bachteler, and Jörg Reiher in
[A Novel Error-Tolerant Anonymous Linking Code](https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3549247).
<p align="center">
<img alt="Clkhash Logo" src="./docs/_static/logo.svg" width="250" />
</p>

`clkhash` supports Python versions 3.6+
<div align="center">

[![codecov](https://codecov.io/gh/data61/clkhash/branch/master/graph/badge.svg)](https://codecov.io/gh/data61/clkhash)
[![Documentation Status](https://readthedocs.org/projects/clkhash/badge/?version=latest)](http://clkhash.readthedocs.io/en/latest/?badge=latest)
[![Unit Testing](https://github.com/data61/clkhash/actions/workflows/unittests.yml/badge.svg)](https://github.com/data61/clkhash/actions/workflows/unittests.yml)
[![Typechecking](https://github.com/data61/clkhash/actions/workflows/typechecking.yml/badge.svg)](https://github.com/data61/clkhash/actions/workflows/typechecking.yml)
[![Downloads](https://pepy.tech/badge/clkhash)](https://pepy.tech/project/clkhash)

</div>

**clkhash** is a Python implementation of cryptographic linkage key hashing as described by _Rainer Schnell, Tobias Bachteler, and Jörg Reiher_ in
[A Novel Error-Tolerant Anonymous Linking Code](https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3549247).

## Installation

Install clkhash with all dependencies using pip:
Expand All @@ -23,7 +29,7 @@ Install clkhash with all dependencies using pip:
[https://clkhash.readthedocs.io](https://clkhash.readthedocs.io/en/latest/)


## clkhash api
## Python API

To hash a CSV file of entities using the default schema:

Expand All @@ -33,6 +39,10 @@ fake_pii_schema = randomnames.NameList.SCHEMA
clks = clk.generate_clk_from_csv(open('fake-pii-out.csv','r'), 'secret', fake_pii_schema)
```

## Command Line Interface

See [Anonlink Client](https://github.com/data61/anonlink-client) for a command line interface to clkhash.

## Citing

Clkhash, and the wider Anonlink project is designed, developed and supported by
Expand Down
26 changes: 10 additions & 16 deletions clkhash/bloomfilter.py
Original file line number Diff line number Diff line change
Expand Up @@ -37,10 +37,7 @@ def double_hash_encode_ngrams(ngrams: Iterable[str],
) -> bitarray:
""" Computes the double hash encoding of the ngrams with the given keys.
Using the method from:
Schnell, R., Bachteler, T., & Reiher, J. (2011).
A Novel Error-Tolerant Anonymous Linking Code.
http://grlc.german-microsimulation.de/wp-content/uploads/2017/05/downloadwp-grlc-2011-02.pdf
Using the method from [Schnell2011]_.
:param ngrams: list of n-grams to be encoded
:param keys: hmac secret keys for md5 and sha1 as bytes
Expand All @@ -60,7 +57,7 @@ def double_hash_encode_ngrams_non_singular(ngrams: Iterable[str],
l: int,
encoding: str
) -> bitarray:
""" computes the double hash encoding of the n-grams with the given keys.
""" Computes the double hash encoding of the n-grams with the given keys.
The original construction of [Schnell2011]_ displays an abnormality for
certain inputs:
Expand Down Expand Up @@ -108,7 +105,7 @@ def _double_hash_encode_ngrams(ngrams: Tuple[str, ...],
ks: Tuple[int, ...],
l: int,
encoding: str,
non_singular
non_singular: bool
) -> bitarray:
key_sha1, key_md5 = keys
bf = bitarray(l)
Expand All @@ -117,9 +114,9 @@ def _double_hash_encode_ngrams(ngrams: Tuple[str, ...],
for m, k in zip(ngrams, ks):
m_bytes = m.encode(encoding=encoding)
if non_singular:
md5hm, sha1hm = _double_hash_token_non_singular(m.encode(encoding=encoding), l, key_sha1, key_md5)
md5hm, sha1hm = _double_hash_token_non_singular(m_bytes, l, key_sha1, key_md5)
else:
md5hm, sha1hm = _double_hash_token(m.encode(encoding=encoding), l, key_sha1, key_md5)
md5hm, sha1hm = _double_hash_token(m_bytes, l, key_sha1, key_md5)
for i in range(k):
gi = (sha1hm + i * md5hm) % l
bf[gi] = 1
Expand Down Expand Up @@ -160,11 +157,9 @@ def blake_encode_ngrams(ngrams: Iterable[str],
) -> bitarray:
""" Computes the encoding of the ngrams using the BLAKE2 hash function.
We deliberately do not use the double hashing scheme as proposed in [
Schnell2011]_, because this
would introduce an exploitable structure into the Bloom filter. For more
details on the
weakness, see [Kroll2015]_.
We deliberately do not use the double hashing scheme as proposed in
[Schnell2011]_, because this would introduce an exploitable structure
into the Bloom filter. For more details on the weakness, see [Kroll2015]_.
In short, the double hashing scheme only allows for :math:`l^2`
different encodings for any possible n-gram,
Expand Down Expand Up @@ -318,13 +313,12 @@ def crypto_bloom_filter(record: Sequence[str],
) -> Tuple[bitarray, str, int]:
""" Computes the composite Bloom filter encoding of a record.
Using the method from
http://www.record-linkage.de/-download=wp-grlc-2011-02.pdf
Based on the method from [Schnell2011]_.
:param record: plaintext record tuple. E.g. (index, name, dob, gender)
:param comparators: A list of comparators. They provide a 'tokenize' function to turn string into
appropriate tokens.
:param schema: Schema
:param schema: The Linkage Schema describing how to encode plaintext identifiers.
:param keys: Keys for the hash functions as a tuple of lists of bytes.
:return: 3-tuple:
Expand Down
4 changes: 2 additions & 2 deletions clkhash/validate_data.py
Original file line number Diff line number Diff line change
Expand Up @@ -12,8 +12,8 @@
class EntryError(ValueError):
""" An entry is invalid.
"""
row_index = None # type: Optional[int]
field_spec = None # type: Optional[FieldSpec]
row_index: Optional[int] = None
field_spec: Optional[FieldSpec] = None


class FormatError(ValueError):
Expand Down
2 changes: 1 addition & 1 deletion pyproject.toml
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
[tool.poetry]
name = "clkhash"
version = "0.17.1"
version = "0.18.0"
description = "Encoding utility to create Cryptographic Linkage Keys"
license = "Apache"
authors = ["Brian Thorne", "Wilko Henecka", "Guillaume Smith"]
Expand Down

0 comments on commit d606c9a

Please sign in to comment.