-
Notifications
You must be signed in to change notification settings - Fork 76
Home
Data is continuously transformed by computation. Understanding the origins of a piece of data can help in a variety of circumstances. For example, the data's history can facilitate fault analysis, decide how much the data should be trusted, or aid in profiling applications.
SPADE provides functionality to track and analyze the provenance of data that arises from multiple sources, distributed over the wide area, and at varied levels of abstraction.
SPADE provides a cross-platform distributed data provenance collection, filtration, storage, and querying service. It includes support for collecting provenance from the Linux, macOS, and Windows operating systems. SPADE uses the auditing functionality of each operating system, which remains stable across various releases, to transparently record the provenance of all data. Installation can be performed with a pre-built package or from source code.
SPADE automates the generation and collection of data provenance at the operating system level. It provides a broad view of activity across all the computers it is installed on in a distributed system. SPADE does this without requiring applications or the operating systems to be modified. It reports information about the name, owner, group, parent, host, creation time, command line, and environment variables of each process. It also reports the name, path, host, size, and modification time of files read or written during a computation. All this information can be collected with a few simple commands.
SPADE supports the use of variables, constraints, lineage, path, and set operators when searching local provenance records. It also supports graph and relational (SQL) queries over local provenance. Provenance collected by SPADE can also be inspected with third-party tools, such as Neoclipse and SQL Workbench. Finally, the SPADE query tool can transparently resolve lineage queries that span multiple hosts in a distributed system.
SPADE is designed to be extensible in multiple ways. A reporter can be implemented to collect provenance activity about a new domain of interest. A new filter can be written to perform novel transformations on provenance events. A new storage system can be added to record provenance in a different format. A new sketch can be designed to optimize the distributed querying. A new transformer can be used to dynamically rewrite query responses.
Please use the links in the sidebar on the right to learn how to use SPADE to collect, filter, store, and query your provenance records.
This material is based upon work supported by the National Science Foundation under Grants OCI-0722068, IIS-1116414, and ACI-1547467. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.
- Setting up SPADE
- Storing provenance
-
Collecting provenance
- Across the operating system
- Limiting collection to a part of the filesystem
- From an external application
- With compile-time instrumentation
- Using the reporting API
- Of transactions in the Bitcoin blockchain
- Filtering provenance
- Viewing provenance
-
Querying SPADE
- Illustrative example
- Transforming query responses
- Protecting query responses
- Miscellaneous