With the ff1
mask, it is possible to mask data by encryption while preserving its original format.
FF1 is a format-preserving block cipher algorithm recommended by the NIST.1
As of March 2021, FF1 is the only suitable FPE algorithm.2
The motivation for a FPE Mask in PIMO is to meet the requirement of re-identification of the original data (or pseudonymisation) as defined by GDPR in Europe.3
Consider the following stream of object to mask :
{"siret": "01234567891234"}
{"siret": "01234567891234"}
{"siret": "12345678912340"}
{"siret": "23456789123401"}
The siret
column is always a 14-digit string. This can be masked by FPE with the following configuration.
version: "1"
masking:
- selector:
jsonpath: "siret"
mask:
# use of the FF1 mask
ff1:
# radix 10 specify that only the 10 digits are used in the output format
radix: 10
# name of the environment variable containing the base64-encoded secret key (note: key length must be 128, 192, or 256 bits)
keyFromEnv: "FF1_ENCRYPTION_KEY"
Here is the result of applying the above configuration.
NOTE
All command lines are listed in demo.sh.
$ # we first need to set the secret key to use with the proper variable name and encoding
$ export FF1_ENCRYPTION_KEY=$(echo -n "secret12secret12" | base64)
$ cat data.jsonl | pimo
{"siret":"96415668837614"}
{"siret":"94015424363597"}
{"siret":"31043158804356"}
FF1 uses a fixed domain definition (list of all allowed characters in an output encrypted string).
0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ
The radix
parameter determine which part of the domain definition will actually be used. For example, a radix of 10 will produce values containing only digits (the 10 first characters of the full domain definition).
Therefore, the value of radix
must less or equal than 62. Also a value of 1 or 0 is invalid.
To re-identify original data, use the same mask defined in the encryption example, but enable the decrypt
option :
version: "1"
masking:
- selector:
jsonpath: "siret"
mask:
# use of the FF1 mask
ff1:
# important: use the same radix parameter as for encryption (values will be decrypted incorrectly if 11 is used for example)
radix: 10
# use the same secret key as for encryption
keyFromEnv: "FF1_ENCRYPTION_KEY"
# activate decryption
decrypt: true
Here is the result of applying the above configuration on the encrypted stream.
NOTE
All command lines are listed in demo.sh.
$ # the same key is re-used for encryption and decryption
$ export FF1_ENCRYPTION_KEY=$(echo -n "secret12secret12" | base64)
$ # the encrypted stream is generated by the same command line as before : cat data.jsonl | pimo
$ # the decryption is done by the second part : pimo -c masking-decrypt.yml
$ cat data.jsonl | pimo | pimo -c masking-decrypt.yml
{"siret":"01234567891234"}
{"siret":"12345678912340"}
{"siret":"23456789123401"}
The tweak is an optional parameter, that reduce the attack surface by using a varying value on each record. It can be considered as an extension of the secret key that change on each record, but is not necessarily kept secret.
Note that to re-identify (decrypt) each tweak will be required (and it must be possible to dispatch the tweaks to exactly the same records as in the encryption step).
Note also that, by using random tweak on each record, collisions can occurs in the output stream (that is not the case if tweak is not used or is a contant). However, such collisions in masked data is not a problem for re-identification of original data.
The tweaks can already be present in the data or can be generated by PIMO as in the following example :
version: "1"
masking:
# add a tweakfield on each record of the jsonl stream
- selector:
jsonpath: "tweakfield"
mask:
add: ""
# give the tweakfield a 8 character long random value
- selector:
jsonpath: "tweakfield"
mask:
regex: "[a-zA-Z0-9]{8}"
- selector:
jsonpath: "siret"
mask:
ff1:
radix: 10
keyFromEnv: "FF1_ENCRYPTION_KEY"
# FF1 will use the value of the tweakfield column as a tweak parameter
tweakField: "tweakfield"
Here is the result of applying the above configuration on the encrypted stream.
NOTE
All command lines are listed in demo.sh.
$ # the same key is re-used for encryption and decryption
$ export FF1_ENCRYPTION_KEY=$(echo -n "secret12secret12" | base64)
$ cat data.jsonl | pimo -c masking-tweak.yml
{"siret":"19309267052199","tweak":"gY6SpkUA"}
{"siret":"84107001872814","tweak":"l3rIYUkm"}
{"siret":"26786954568342","tweak":"P9k0XCRk"}
To be considered secure, the domain size of the cipher must be at least 1.000.0004
The domain size is given by the following formula :
ds = radixlen
With :
- radix : the radix choosen by configuration in the mask definition
- len : (minimum) length of the data to encrypt
Applied to the current examples, the domain size is :
ds = 1014 = 100 000 000 000 000
Which would be considered very secured.
1 “Recommendation for Block Cipher Modes of Operation: Methods for Format-Preserving Encryption”, Morris Dworkin, for NIST, March 2016. ↩
2 “Recent Cryptanalysis of FF3”, NIST Website, 12 April 2017. ↩
3 “Pseudonymization” in Wikipedia, Wikimedia Foundation, 26 January 2021. ↩
4 “Methods for Format-Preserving Encryption: NIST Requests Public Comments on Draft Special Publication 800-38G Revision 1”, NIST Website, 28 February 2019. ↩