Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Differentiating between privacy-sensitive and anonymous data #49

Open
blootsvoets opened this issue Apr 10, 2017 · 6 comments
Open

Differentiating between privacy-sensitive and anonymous data #49

blootsvoets opened this issue Apr 10, 2017 · 6 comments

Comments

@blootsvoets
Copy link
Collaborator

We could consider to process privacy-sensitive data, but to encrypt it before sending it to the server. Keys to the encryption could be provided to researchers that are allowed to access those data (for example, using PEP). For non-sensitive data, we could send the data in plain text as we do now, so that the Kafka streams can aggregate it properly. Using the PEP mechanism, those data could also be encrypted, but the Kafka streams could get a key for only those data.

@blootsvoets
Copy link
Collaborator Author

blootsvoets commented Apr 10, 2017

Read through the PEP paper, this is based on a new encryption algorithm. It would be a complete infrastructure effort to implement this. The issue remains: everything that does not have to be shown in the dashboard or retrieved using the API, could be encrypted. This would involve especially privacy-sensitive stuff. In all cases, we would have to trust the researchers to handle their private keys with care though... Another note, for example Tresorit has a nice key exchange algorithm as well, and does not encrypt the data itself with those keys, but instead it encrypts the encryption keys for the data. That makes the data encryption less heavy, plus no re-encryption is needed if keys change. In any case, we'd have properly follow their protocol (or another well-documented protocol) to avoid any of the pitfalls in encryption.

@fnobilia
Copy link
Contributor

If we encrypt data using a key that is unknown to the Platform, we cannot apply any analysis on this data. What kind of data would you encrypt with this method?

@blootsvoets
Copy link
Collaborator Author

blootsvoets commented Apr 10, 2017

Exactly, that's the point. So for example absolute locations, IP addresses or unprocessed voice data are privacy sensitive. However, we could choose to store them in encrypted way. The platform would not be able to read or process it, but just provide it as-is in the full data extracted from HDFS. Less sensitive data, such as battery levels, we'd send unencrypted so our platform can process it. We could decide on a stream-per-stream basis whether we want the data encrypted or unencrypted. Also, we could choose to leave the keys always unencrypted (anonymous patient ID), but just encrypt the values.

@blootsvoets
Copy link
Collaborator Author

Another alternative is to do the data processing on another "trusted" host, where we would provide the decryption key as well. Right now, I don't think we have the budget + motivation to have this additional infrastructure cost though.

@fnobilia
Copy link
Contributor

The vast majority of collected variables are privacy sensitive (HR, Acc, ecc.. ). We can absolutely design something to provide also this functionality, but we should bring WP8 up in the discussion or wait a clear need/requirement.

@blootsvoets
Copy link
Collaborator Author

As long as the HR and Acc is not coupled to a specific person, I'd consider them anonymised data, which would be fine to process if we don't know the identity. However, something like absolute location can be used to find someones home and then identity. Likewise with voice recognition and IP address.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants