-
Notifications
You must be signed in to change notification settings - Fork 75
Provenance Privacy
We assume a decentralized setting, where each host collects and stores provenance metadata that describes its activity. Responses to remote provenance queries may contain privacy-sensitive information. SPADE provides three types of primitives to aid in preserving privacy: sanitization, encryption, and differential privacy.
Privacy preservation through sanitization performs an irreversible transformation of query response graphs. Annotations on vertices and edges can be elided based on the sanitization level specified and the annotation-specific schemes configured.
The Sanitization
transformer uses three levels of levels of sanitization: low
, medium
, and high
. To use the transformer, use the following command in SPADE's control client:
add transformer Sanitization sanitizationLevel={low,medium,high}
The sanitizationLevel
defines the extent of sanitization performed. The configuration file spade.transformer.Sanitization.config
defines the annotation-specific operations. Here is a sample file:
low
cwd,fsgid,fsuid,sgid,suid,remote address[sanitizeIpAddress],path[sanitizePath],time[sanitizeTime]
medium
command line,uid,gid,remote address[sanitizeIpAddress],path[sanitizePath],time[sanitizeTime],size
high
name,euid,remote address[sanitizeIpAddress],path[sanitizePath],time[sanitizeTime],operation
A line in the file contains the sanitizationLevel
. Subsequent lines can contain comma-separated lists of annotations. Each annotation can optionally be followed by the name of a function (in the Sanitization
transformer) in square brackets: <annotation_name>[sanitizationHandler]
. If present, this function is used to provide custom handling of the associated value.
Illustrative strategies for sanitizing composite annotations are described below.
Privacy preservation through encryption performs reversible transformations of the response graphs. Data is encrypted using ciphertext-policy attribute-based encryption (CP-ABE). We assume an administrator has issued each host a set of credentials, corresponding to their attributes. An encryption policy is framed as a Boolean expression over these attributes. Hosts with attributes that satisfy the expression are able to decrypt elements encrypted with the policy.
Each host is assumed to have received a subset of low
, medium
, and high
attributes. The ABE
transformer encrypts the value associated with each annotation using a CP-ABE policy. The spade.transformer.ABE.config
configuration specifies the mapping between a policy and the annotations handled by it. When a host receives a response with elements that have been encrypted, the SPADE instance will transparently perform decryption (assuming it has sufficient attributes).
In the sample configuration file below, 'keysDirectory' contains the master public key for encryption and the credentials (i.e. secret keys) used for decryption. (This is described further below.) Subsequent lines can contain comma-separated lists of annotations. Each annotation can optionally be followed by the name of a function (in the ABE
transformer) in square brackets: <annotation_name>[encryptionHandler]
. If present, this function is used to provide custom handling of the associated value.
keysDirectory=cfg/keys/attributes
low
cwd,fsgid,fsuid,sgid,suid,remote cwd,fsgid,fsuid,sgid,suid,remote address[EncryptedIPAddress],path[EncryptedPath],time[EncryptedTime]
medium
command line,uid,gid,remote address[EncryptedIPAddress],path[EncryptedPath],time[EncryptedTime],size
high
name,euid,remote address[EncryptedIPAddress],path[EncryptedPath],time[EncryptedTime],operation
SPADE uses OpenABE. The toolkit must be downloaded and installed before the following steps can be completed.
-
The administrator should setup generate the OpenABE master key pair for the
Ciphertext-Policy ABE
algorithm. -
The administrator should generate a set of credentials for each host's set of attributes. (The attributes correspond to the level of encryption in the above described scheme.)
-
The administrator should send the master public key and the host's credentials to it.
Detailed steps are available in the OpenABE documentation.
To use the transformer to encrypt a host's query responses, run this in the SPADE control client on the host:
add transformer ABE
Below are sample strategies for sanitizing / encrypting composite annotations.
remote address (xxx.xxx.xxx.xxx)
low
, the second octet is sanitized / encrypted.
medium
, the third octet is sanitized / encrypted.
high
, the fourth octet is sanitized / encrypted.
path (w/x/y/z/...)
low
, path after first level is sanitized / encrypted.
medium
, path after the second level is sanitized / encrypted.
high
, path after the third level is sanitized / encrypted.
time (yyyy-MM-dd HH:mm:ss.SSS)
low
, day is sanitized / encrypted.
medium
, hour is sanitized / encrypted.
high
, minute, second and millisecond are sanitized / encrypted.
Differential privacy is a mechanism for sharing abstracted query responses from a database without disclosing information about individual records. With differential privacy, aggregate database information is returned with the addition of statistical noise. The aim is to provide useful but privacy-preserving information to the querier. Foundational work on ε-differential privacy provides a mathematical definition of the mechanism.
The QuickGrail query surface in SPADE allows users to send four types of aggregate queries. These are: (i) mean, (ii) standard deviation, (iii) histogram, and (iv) distribution queries. The histogram query shows the count of each unique value of a specified annotation key. The distribution query is similar but instead automatically determines the range of values associated with the specified key, creates a specified number of sub-ranges (partitions), and reports the counts in each sub-range (partition).
To send an aggregate query in SPADE's query client, use the stat
command as follows:
stat <vertex | edge> <annotation name> <aggregate type> [<additional arguments>] <graph variable>
The possible values for <aggregate type>
are mean
, std
, histogram
, and distribution
. For example, the mean file size may be of interest. Given a graph variable $files
, assume each file vertex in it has a filesize
annotation. This query can then be used:
stat vertex filesize mean $files
As another example, the number of processes owned by each user may be of interest. Given a variable $processes, this query can be used:
stat vertex owner histogram $processes
A distribution is an abstraction of a histogram where the values are grouped together into a specified number of partitions. Assume the time
annotation on edges reports how long an operation took. Given an $operations
variable, a distribution with 5 partitions can be computed with:
stat edge time distribution 5 $operations
Each partition in this example contains a different quintile of time
annotation keys. Specifically, the range from the minimum to the maximum value of the time
annotation key is split into 5 equal sub-ranges. The count for a partition is the number of edges with a time
annotation key that has a value in the corresponding sub-range.
SPADE enables the result of the above aggregate queries to be made differentially private. The implementation uses Google's open-source differential-privacy library.
Differential privacy for aggregate queries can be enabled in SPADE's configuration as follows. To enable differential privacy, set the epsilon
value to the desired level of privacy in cfg/spade.core.AbstractAnalyzer.config. To disable differential privacy, set epsilon
to -1
.
For more information on differential privacy and its implementation, see:
This material is based upon work supported by the National Science Foundation under Grants OCI-0722068, IIS-1116414, and ACI-1547467. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.
- Setting up SPADE
- Storing provenance
-
Collecting provenance
- Across the operating system
- Limiting collection to a part of the filesystem
- From an external application
- With compile-time instrumentation
- Using the reporting API
- Of transactions in the Bitcoin blockchain
- Filtering provenance
- Viewing provenance
-
Querying SPADE
- Illustrative example
- Transforming query responses
- Protecting query responses
- Miscellaneous