Provenance Privacy

We assume a decentralized setting, where each host collects and stores provenance metadata that describes its activity. Responses to remote provenance queries may contain privacy-sensitive information. SPADE provides three types of primitives to aid in preserving privacy: sanitization, encryption, and differential privacy.

Sanitization

Privacy preservation through sanitization performs an irreversible transformation of query response graphs. Annotations on vertices and edges can be elided based on the sanitization level specified and the annotation-specific schemes configured.

The Sanitization transformer uses three levels of levels of sanitization: low, medium, and high. To use the transformer, use the following command in SPADE's control client:

add transformer Sanitization sanitizationLevel={low,medium,high}

The sanitizationLevel defines the extent of sanitization performed. The configuration file spade.transformer.Sanitization.config defines the annotation-specific operations. Here is a sample file:

low
cwd,fsgid,fsuid,sgid,suid,remote address[sanitizeIpAddress],path[sanitizePath],time[sanitizeTime]

medium
command line,uid,gid,remote address[sanitizeIpAddress],path[sanitizePath],time[sanitizeTime],size

high
name,euid,remote address[sanitizeIpAddress],path[sanitizePath],time[sanitizeTime],operation

A line in the file contains the sanitizationLevel. Subsequent lines can contain comma-separated lists of annotations. Each annotation can optionally be followed by the name of a function (in the Sanitization transformer) in square brackets: <annotation_name>[sanitizationHandler]. If present, this function is used to provide custom handling of the associated value.

Illustrative strategies for sanitizing composite annotations are described below.

Encryption

Privacy preservation through encryption performs reversible transformations of the response graphs. Data is encrypted using ciphertext-policy attribute-based encryption (CP-ABE). We assume an administrator has issued each host a set of credentials, corresponding to their attributes. An encryption policy is framed as a Boolean expression over these attributes. Hosts with attributes that satisfy the expression are able to decrypt elements encrypted with the policy.

Each host is assumed to have received a subset of low, medium, and high attributes. The ABE transformer encrypts the value associated with each annotation using a CP-ABE policy. The spade.transformer.ABE.config configuration specifies the mapping between a policy and the annotations handled by it. When a host receives a response with elements that have been encrypted, the SPADE instance will transparently perform decryption (assuming it has sufficient attributes).

In the sample configuration file below, 'keysDirectory' contains the master public key for encryption and the credentials (i.e. secret keys) used for decryption. (This is described further below.) Subsequent lines can contain comma-separated lists of annotations. Each annotation can optionally be followed by the name of a function (in the ABE transformer) in square brackets: <annotation_name>[encryptionHandler]. If present, this function is used to provide custom handling of the associated value.

keysDirectory=cfg/keys/attributes

low
cwd,fsgid,fsuid,sgid,suid,remote cwd,fsgid,fsuid,sgid,suid,remote address[EncryptedIPAddress],path[EncryptedPath],time[EncryptedTime]

medium
command line,uid,gid,remote address[EncryptedIPAddress],path[EncryptedPath],time[EncryptedTime],size

high
name,euid,remote address[EncryptedIPAddress],path[EncryptedPath],time[EncryptedTime],operation

OpenABE

SPADE uses OpenABE. The toolkit must be downloaded and installed before the following steps can be completed.

The administrator should setup generate the OpenABE master key pair for the Ciphertext-Policy ABE algorithm.
The administrator should generate a set of credentials for each host's set of attributes. (The attributes correspond to the level of encryption in the above described scheme.)
The administrator should send the master public key and the host's credentials to it.

Detailed steps are available in the OpenABE documentation.

To use the transformer to encrypt a host's query responses, run this in the SPADE control client on the host:

add transformer ABE

Strategies

Below are sample strategies for sanitizing / encrypting composite annotations.

remote address (xxx.xxx.xxx.xxx)

low, the second octet is sanitized / encrypted.

medium, the third octet is sanitized / encrypted.

high, the fourth octet is sanitized / encrypted.

path (w/x/y/z/...)

low, path after first level is sanitized / encrypted.

medium, path after the second level is sanitized / encrypted.

high, path after the third level is sanitized / encrypted.

time (yyyy-MM-dd HH:mm:ss.SSS)

low, day is sanitized / encrypted.

medium, hour is sanitized / encrypted.

high, minute, second and millisecond are sanitized / encrypted.

Differential Privacy

Differential privacy is a mechanism for sharing abstracted query responses from a database without disclosing information about individual records. With differential privacy, aggregate database information is returned with the addition of statistical noise. The aim is to provide useful but privacy-preserving information to the querier. Foundational work on ε-differential privacy provides a mathematical definition of the mechanism.

Aggregate queries

The QuickGrail query surface in SPADE allows users to send four types of aggregate queries. These are: (i) mean, (ii) standard deviation, (iii) histogram, and (iv) distribution queries. The histogram query shows the count of each unique value of a specified annotation key. The distribution query is similar but instead automatically determines the range of values associated with the specified key, creates a specified number of sub-ranges (partitions), and reports the counts in each sub-range (partition).

To send an aggregate query in SPADE's query client, use the stat command as follows:

stat <vertex | edge> <annotation name> <aggregate type> [<additional arguments>] <graph variable>

The possible values for <aggregate type> are mean, std, histogram, and distribution. For example, the mean file size may be of interest. Given a graph variable $files, assume each file vertex in it has a filesize annotation. This query can then be used:

stat vertex filesize mean $files

As another example, the number of processes owned by each user may be of interest. Given a variable $processes, this query can be used:

stat vertex owner histogram $processes

A distribution is an abstraction of a histogram where the values are grouped together into a specified number of partitions. Assume the time annotation on edges reports how long an operation took. Given an $operations variable, a distribution with 5 partitions can be computed with:

stat edge time distribution 5 $operations

Each partition in this example contains a different quintile of time annotation keys. Specifically, the range from the minimum to the maximum value of the time annotation key is split into 5 equal sub-ranges. The count for a partition is the number of edges with a time annotation key that has a value in the corresponding sub-range.

Adding response noise

SPADE enables the result of the above aggregate queries to be made differentially private. The implementation uses Google's open-source differential-privacy library.

Differential privacy for aggregate queries can be enabled in SPADE's configuration as follows. To enable differential privacy, set the epsilon value to the desired level of privacy in cfg/spade.core.AbstractAnalyzer.config. To disable differential privacy, set epsilon to -1.

Background

For more information on differential privacy and its implementation, see:

This material is based upon work supported by the National Science Foundation under Grants OCI-0722068, IIS-1116414, and ACI-1547467. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.

Setting up SPADE
Storing provenance
Collecting provenance
- Across the operating system
- Limiting collection to a part of the filesystem
  - On Linux
  - On macOS
- From an external application
- With compile-time instrumentation
- Using the reporting API
- Of transactions in the Bitcoin blockchain
- Filtering provenance
  - Using filters
  - Available filters
Viewing provenance
- In a graph database
- In a relational database
Querying SPADE
- Illustrative example
- Transforming query responses
  - Using transformers
  - Available transformers
- Protecting query responses
Miscellaneous

Provide feedback

Saved searches

Use saved searches to filter your results more quickly