Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ML] Use hashing for categorical data #2199

Open
valeriy42 opened this issue Feb 7, 2022 · 2 comments
Open

[ML] Use hashing for categorical data #2199

valeriy42 opened this issue Feb 7, 2022 · 2 comments

Comments

@valeriy42
Copy link
Contributor

valeriy42 commented Feb 7, 2022

Model inference definition can potentially reveal personally identifiable information used in categorical encoding maps. This is usually not a problem since the access permissions for reviewing the model definitions are the same as for reviewing the training datasets where this PII occurred.

However, there is no reason to have original categorical strings stored in the model. For the learning algorithm, it is sufficient to use the distinct representation of the categories produced by a cryptographic hash function.

Note that the encodings need to be unique only within the same feature, which reduces the complexity of the hash function

@valeriy42
Copy link
Contributor Author

Both Google and Facebook use cryptographic hash function to encode PII. Google suggests using SHA256+salt while Facebook does not explicitly mention the CHF algorithm.

@valeriy42
Copy link
Contributor Author

One could use header-only C++ library for sha256 generation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant