[ML] Use hashing for categorical data #2199

valeriy42 · 2022-02-07T12:56:05Z

Model inference definition can potentially reveal personally identifiable information used in categorical encoding maps. This is usually not a problem since the access permissions for reviewing the model definitions are the same as for reviewing the training datasets where this PII occurred.

However, there is no reason to have original categorical strings stored in the model. For the learning algorithm, it is sufficient to use the distinct representation of the categories produced by a cryptographic hash function.

Note that the encodings need to be unique only within the same feature, which reduces the complexity of the hash function

valeriy42 · 2022-02-07T14:21:39Z

Both Google and Facebook use cryptographic hash function to encode PII. Google suggests using SHA256+salt while Facebook does not explicitly mention the CHF algorithm.

valeriy42 · 2022-02-07T15:35:26Z

One could use header-only C++ library for sha256 generation.

valeriy42 added :ml/DataFrameAnalysis >feature labels Feb 7, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ML] Use hashing for categorical data #2199

[ML] Use hashing for categorical data #2199

valeriy42 commented Feb 7, 2022 •

edited

Loading

valeriy42 commented Feb 7, 2022

valeriy42 commented Feb 7, 2022

[ML] Use hashing for categorical data #2199

[ML] Use hashing for categorical data #2199

Comments

valeriy42 commented Feb 7, 2022 • edited Loading

valeriy42 commented Feb 7, 2022

valeriy42 commented Feb 7, 2022

valeriy42 commented Feb 7, 2022 •

edited

Loading