Path Transparency Observatory API Specification

Path Transparency Observatory API specifiation, version 3

The API consists of three applications: raw data access and upload, observation access, and observation query. The interface to each application is made up of certain resources accessed in a RESTful way; these resources are specified below.

Access Control and Permissions

All applications use API key based access control. An API key is associated with a set of permissions, scoped to permitted operations on the observatory database, on access to raw data for specific campaigns. In this document, the permission required for a given method on a given resource is noted.

API keys are given in the HTTP Authorization request header, which must consist of the string APIKEY followed by whitespace and the API key as a string.

Raw Data Access and Upload

The raw data access and upload API (resources under /raw) allows the upload of raw data files to the PTO, and the later retrieval of those files. Each file is associated with a campaign -- a group of files related to a single measurement campaign from a single data source. Campaigns and files have associated metadata which is used by the PTO and analysis modules to save metadata about the raw data; this may also be used by users of the PTO to store information about raw files and campaigns.

The resources and methods available thereon are summarized in the table below.

Method	Resource	Permission	Description
`GET`	`/raw`	`raw_metadata`	Retrieve URLs for campaigns as JSON
`GET`	`/raw/<c>`	`raw_metadata`	Retrieve metadata for campaign c as JSON
`PUT`	`/raw/<c>`	`write_raw:<c>`	Write metadata for campaign c as JSON
`GET`	`/raw/<c>/<f>`	`raw_metadata`	Retrieve metadata for file f in c as JSON
`PUT`	`/raw/<c>/<f>`	`write_raw:<c>`	Write metadata for file f in c as JSON
`GET`	`/raw/<c>/<f>/data`	`read_raw:<c>`	Retrieve content for file f in c (by convention)
`PUT`	`/raw/<c>/<f>/data`	`write_raw:<c>`	Write content for file f in c (by convention)
`DELETE`	`/raw/<c>/<f>`	`write_raw:<c>`	Delete a file and its metadata
`DELETE`	`/raw/<c>`	`write_raw:<c>`	Delete a campaign and all its files

Metadata

Associated with each file and each campaign in the raw data store is a metadata object. This metadata object is a set of key-value pairs, presented by the API as a JSON object. Certain keys in this object are reserved for use by the system, or are generated by the system. Other metadata keys can be freely used by the creator of a raw data file to communicate metadata information with a future analysis process that will be run on the raw data, or for future retrieval of the file. Metadata key behavior is determined by the metadata key name:

All metadata keys whose names begin with an underscore _ are reserved for system use; they may be written to, but have a special meaning to the PTO itself.
All metadata keys whose names begin with __ are virtual, and generated by the system; they may not be written to.
All other metadata key names are free for use by users and analysis modules.

Files inherit metadata from their containing campaign. If a file's metadata and its containing campaign's metadata have metadata for the same key, the value associated with file overrides that inherited from the campaign for that file.

The following reserved and virtual metadata keys are presently supported:

Key	Description
`_file_type`	PTO filetype. See Filetypes, below.
`_owner`	Identity (via email) of user or organization owning the file/campaign
`_time_start`	Timestamp of first observation in the raw data file, in ISO8601 format
`_time_end`	Time of last observation in the raw data file, in ISO8601 format
`_deprecated`	If present, timestamp at which an observation set was marked deprecated
`__data`	URL of the resource containing file data.
`__data_size`	Size of the file in bytes. 0 if the data file has not been uploaded.

Though the data resource is by convention accessible by appending /data to the path of the metadata resource, the system may at any time place data at another path; therefore, clients should only upload data to the path given in the __data metadata key.

Filetypes

Every raw data file has a filetype, given in the _file_type key, which the PTO uses to determine how to handle files internally, and which analysis modules use to determine how to read and whether they are interested in raw data files. Each filetype is associated with a MIME type, and the Content-Type header on data uploads via PUT must match the filetype associated with the file.

Often, all the files within a campaign will share the same filetype. In this case, filetype information is set in campaign metadata, not in individual file metadata.

While filetypes are extensible, the filetypes supported by the PTO as installed are listed below:

Filetype	MIME type	Description
`obs-bz2`	`application/bzip2`	Compressed observations in OSF
`obs`	`application/vnd.mami.ndjson`	Uncompressed observations in OSF

Raw data API usage

We use curl to illustrate the usage of the PTO raw API. We assume the API is rooted at https://pto.example.com/, and that the API key abadc0de holds the permissions raw_metadata, read_raw:test, and write_raw:test. (In these examples, the output of curl is prettyprinted via python3 -m json.tool, not shown)

Creating and listing campaigns

To list the campaigns for which raw data is stored, simply fetch the /raw resource.

$ curl -H "Authorization: APIKEY abadc0de" https://pto.example.com/raw
{
    "campaigns": []
}

This PTO instance is empty: no campaigns are stored here. To create the test campaign, which we are preauthorized to do, simply upload the campaign's metadata at the campaign's path:

$ cat test_campaign.json
{
    "_owner": "you@example.com",
    "_file_type": "test"
}	

$ curl -H "Authorization: APIKEY abadc0de" \
       -H "Content-Type: application/json" \
       -X PUT https://pto.example.com/raw/test \
       --data-binary @test_campaign.json
{
    "_file_type":"test",
    "_owner":"you@example.com"
}

The reply echoes back the metadata uploaded. A campaign's metadata can be changed by simply uploading new metadata.

We can verify that our campaign has been created by listing campaigns again:

$ curl -H "Authorization: APIKEY abadc0de" https://pto.example.com/raw
{
    "campaigns": [
        "http://pto.example.com/raw/test"
    ]
}

Uploading Raw Data

Once a campaign has been created, uploading raw data to it is a two-step process: creating a new file by uploading its metadata, then uploading the data associated with the file.

For the purposes of this example, we'll upload a single test data file containing some JSON formatted data. First the metadata:

$ cat test_metadata.json
{
    "_time_start": "2018-04-25T10:15:35Z",
    "_time_end":   "2018-04-25T10:20:48Z",
    "purpose":     "demonstrate file upload"
}

$ curl -H "Authorization: APIKEY abadc0de" \
       -H "Content-Type: application/json" \
       -X PUT https://pto.example.com/raw/test/test001.json \
       --data-binary @test_metadata.json
{
    "__data": "https://pto.example.com/raw/test/test001.json/data",
    "_file_type": "test",
    "_owner": "you@example.com",
    "_time_end": "2018-04-25T10:20:48Z",
    "_time_start": "2018-04-25T10:15:35Z",
    "purpose": "demonstrate file upload"
}

Here the uploaded metadata, including keys inherited from the campaign, is echoed back from the server, along with a link to which data can be uploaded (in the __data key).

Now that the metadata is created, we can upload the data file to the given URL:

$ curl -H "Authorization: APIKEY abadc0de" \
       -H "Content-Type: application/json" \
       -X PUT https://pto.example.com/raw/test/test001.json/data \
       --data-binary @test_data.json
{
    "__data": "https://pto.example.com/raw/test/test001.json/data",
    "__data_size": 37,
    "_file_type": "test",
    "_owner": "you@example.com",
    "_time_end": "2018-04-25T10:20:48Z",
    "_time_start": "2018-04-25T10:15:35Z",
    "purpose": "demonstrate file upload"
}

This echoes back the metadata for the file. Note here the new __data_size key, which gives the size of the data file in bytes.

Downloading Raw Data

While the current PTO implementation by convention always generates data URLs from metadata URLs by appending /data to the path, this is not guaranteed to always be the case, so it's important to check the __data key in the metadata for the file before downloading. Here we assign this to a shell variable, then download from that url:

$ DATAURL=`curl -s -H "Authorization: APIKEY abadc0de" \
                https://pto.example.com/raw/test/test001.json | \
           python3 -c 'import sys, json; print(json.load(sys.stdin)["__data"])'`
$ curl -H "Authorization: APIKEY abadc0de" $DATAURL > downloaded_file.json

Changing Metadata and Data

Metadata can be changed by uploading a new metadata object.

Once a file has been uploaded, its data can no longer be changed.

Files can currently not be deleted via the API, though a file can be fully deleted by removing the file and metadata file from the filesystem backing the raw data store. To mark a file (or a campaign) as no longer valid, use the _deprecated system metadata tag.

Observation Access

The observation access API (resources under /obs) allows access to PTO observations. An observation is a tuple consisting of the following elements:

a time interval during which the observation is valid, consisting of a start time and an end time;
a path on which the observation is made, consisting of a sequence of path elements;
a condition observed on this path; and
an optional value associated with the condition.

More about the PTO's information model is given here

Observations are grouped into observation sets. An observation set is a set of observations resulting from a single run of an analyser on some input data (see Data Analysis, below). All observations in an observation set share the same metadata and provenance. Provenance provides information about the source data that was analyzed (in terms of raw data files and/or other observation sets stored in the PTO) and the analysis that was performed.

The resources and methods available thereon are summarized in the table below.

Method	Resource	Permission	Description
`GET`	`/obs`	`read_obs`	Retrieve URLs for observation sets as JSON
`GET`	`/obs/by_metadata`	`read_obs`	Retrieve URLs for observation sets by metadata
`GET`	`/obs/conditions`	`read_obs`	List conditions in observation database
`POST`	`/obs/create`	`write_obs`	Create new observation set
`GET`	`/obs/<o>`	`read_obs`	Retrieve metadata and provenance for o as JSON
`PUT`	`/obs/<o>`	`write_obs`	Update metadata and provenance for o as JSON
`GET`	`/obs/<o>/data`	`read_obs_data`	Retrieve obset file for o as NDJSON (by convention)
`PUT`	`/obs/<o>/data`	`write_obs`	Upload obset file for o as NDJSON (by convention)

Metadata and Provenance

As with raw data files, observation sets have associated metadata; as with raw data files, arbitrary metadata can be set on observation sets by the analyses that create them. The same rules for reserved and virtual metadata names apply for observation sets as for raw data. The metadata for an observation set also contains information about the observation set's provenance.

Provenance is tracked by three metadata keys. _sources is an array of URLs referencing the raw data file or the observation set(s) from which the observations in an observation set are derived. _analyzer is a URL referring to the analyzer that created the observation set. _campaign is a URL referring to the campaign from which the original raw data was derived, and is only present if the observation set is derived only from observation sets / raw data in the same campaign.

The following reserved and virtual metadata keys are presently supported:

Key	Description
`_sources`	Array of PTO URLs of raw data sources and observation sets
`_analyzer`	URL of analyzer metadata
`_conditions`	Array of conditions declared in the observation set
`_deprecated`	If present, timestamp at which an observation set was marked deprecated
`__obs_count`	Count of observations in the observation set
`__time_start`	Timestamp of first observation start time in set
`__time_end`	Timestamp of last observation end time in set
`__data`	URL of the resource containing observation set data

Querying Observation Sets by Metadata

The /obs/by_metadata resource lists links to Observation Sets based on the presence or value of metadata keys, or on the values of particular metadata. The following query parameters are supported:

Key	Description
`k`	Obsets containing metadata key (with `v`, of a specific value)
`v`	Value to query (use with `k`)
`source`	Obsets derived from a source URL starting with a given prefix
`analyzer`	Obsets derived from an analyzer whose metadata URL starts with a given prefix
`condition`	Obsets declaring a given condition

When multiple parameters are given, the intersection of observation sets fulfilling all parameters is returned.

Analyzer Metadata

Observations refer to how they were created via the _analyzer metadata key. This either contains a URL pointing at an analyzer metadata object, or a URL pointing at a source code repository, including tag referring to a specific revision or commit, which itself contains analyzer metadata in a file named __pto_analyzer_metadata.json in its root directory.

Analyzer metadata is stored as a JSON object, which must have the following keys:

Key	Description
`_repository`	URL of source code repository, if not implicit
`_owner`	Identity (via email) of user or organization owning the analyzer
`_file_types`	File types consumable by raw analyzer, as array
`_invocation`	Command to run in repository root to invoke the analyzer, if local
`_platform`	Platform identifier; see interface description

As with raw and observation metadata, all keys not beginning with _ are freeform, and may be used to store other information about the analyzer.

An analyzer can be one of two types. A raw analyzer (or normalizer) reads in raw data of one or more filetypes, and produces observations. A derived analyzer reads in observations from one or more sets, and produces observations. Raw analyzer metadata must contain a _file_types key, a JSON array listing PTO filetypes it can consume; derived analyzer metadata must not contain this key.

Analyzers can take one of two forms. A local analyzer is designed to run within the PTO, with direct access to raw data files or observations. It reads raw data or observation files on standard input, and produces observation files and observation set metadata on standard output. Local analyzer metadata must contain an _invocation key, which is a command to run in the repository root to invoke the analyzer. More on the interface for local analyzers is given here.

A local analyzer runtime is not yet available in the PTO; local analyzers can be run manually with the command-line tools described here.

A client analyzer is designed to use the PTO API to retrieve raw data and observation sets and upload its results. As it cannot be automatically invoked, client analyzer repositories should contain human-readable documentation for invocation. Client analyzer metadata must not contain an _invocation key.

See the analyzer interface description for more.

Observation API usage

As above, we use curl to illustrate the usage of the PTO observation API. We assume the API is rooted at https://pto.example.com/, and that the API key abadc0de holds the permissions read_obs and write_obs.

Uploading an observation set

First, create observation set metadata, and upload it to get a new observation set ID. The observation set metadata must contain the full list of conditions contained in the observation set, a link to analyzer metadata (see the analyzer interface), and a list of links to raw data sources from which the observations were derived

$ cat obs_metadata.json
{
  "_conditions": ["pto.test.ok","pto.test.not_ok"],
  "_analyzer":   "https://gitlab.example.com/analyzers/test_analyzer/raw/master/analyzer_meta.json"
  "_sources":    ["https://pto.example.com/raw/test"]
}

$ curl -H "Authorization: APIKEY abadc0de" \
       -H "Content-Type: application/json" \
       -X POST https://pto.example.com/obs/create \
       --data-binary @obs_metadata.json
{
  "__data": "http://localhost:8383/obs/1/data",
  "__link": "http://localhost:8383/obs/1",
  "__modified": "2018-06-07T08:29:26Z",
  "__created": "2018-06-07T08:29:26Z",
  "_conditions": ["pto.test.ok","pto.test.not_ok"],
  "_analyzer":   "https://gitlab.example.com/analyzers/test_analyzer/raw/master/analyzer_meta.json"
  "_sources":    ["https://pto.example.com/raw/test"]
}

This returns the metadata, including new system metadata. The __link key here is a permanent link to the created observation set, including its ID, and the __data key is a path to which observation set data can be uploaded. Let's do that.

$ curl -H "Authorization: APIKEY abadc0de" \
       -H "Content-Type: application/vnd.mami.ndjson" \
       -X PUT https://pto.example.com/obs/1/data \
       --data-binary @obs_data.ndjson
{
  "__data": "http://localhost:8383/obs/1/data",
  "__link": "http://localhost:8383/obs/1",
  "__modified": "2018-06-07T08:31:14Z",
  "__created": "2018-06-07T08:31:14Z",
  "__obs_count": 2,
  "_conditions": ["pto.test.ok","pto.test.not_ok"],
  "_analyzer":   "https://gitlab.example.com/analyzers/test_analyzer/raw/master/analyzer_meta.json"
  "_sources":    ["https://pto.example.com/raw/test"]
}

Similar to uploading a raw data file, the new __obs_count metadata key shows the number of observations that have been stored.

Observation Query

The observation query API (resources under /query) allows the submission of queries to the PTO, to retrieve observations and observation aggregates meeting certain criteria from the PTO's observation database.

Method	Resource	Permission	Description
`POST` or `GET`	`/query/submit`	`submit_query_obs` or `submit_query_group`	Submit a query
`GET`	`/query`	`read_query`	List currently cached and pending queries
`GET`	`/query/<q>`	`read_query`	Get query metadata, including ETA for pending queries
`GET`	`/query/<q>/result`	`read_query`	Get query results (by convention)
`PUT`	`/query/<q>`	`update_query`	Update query metadata

Queries can be submitted by POSTing to the /query/submit resource. The query itself is defined by a the parameters in the POSTed application/x-www-form-urlencoded content. The parameters are summarized below:

Parameter	Semantics	Multiple?	Meaning
`time_start`	temporal	no	Select observations starting at or after the given start time
`time_end`	temporal	no	Select observations ending at or before the given end time
`set`	select	yes	Select observations with in the given set ID
`on_path`	select	yes	Select observations with the given element in the path
`source`	select	yes	Select observations with the given element at the start of the path
`target`	select	yes	Select observations with the given element at the end of the path
`condition`	select	yes	Select observations with the given condition, with wildcards
`feature`	select	yes	Select observations with the given condition feature
`aspect`	select	yes	Select observations with the given condition aspect
`group`	group	yes	Group observations and return counts by group
`intersect_condition`	set	yes	Group observations by path, select paths by set intersection on conditions
`option`	options	yes	Specify a query option

All parameters with temporal semantics must be present, and are used to bound the query in time. Parameters with select semantics may be given to filter observations. if multiple instances of a select parameter are available, any of the values will match; however, an observation must match at least one of the values for each distinct parameter given (i.e., the query language supports AND of OR semantics). Parameters with group or set semantics, as well as the option parameter, may modify the type of query and the format of its results; see the Results section below.

Query Options

The option parameter is used to modify the behavior of queries. Multiple Options may be present. The following options are presently supported:

Option Value	Behavior
`sets_only`	Return links to observation sets containing observations answering the query, instead of observation data directly
`count_targets`	Group queries should count distinct targets, not distinct observations

Metadata

When a query is submitted, it goes into the query cache. The query cache holds the query metadata until the query has been scheduled to run. Once it has run, the query metadata will updated to point to the result and the observation sets the query answers.

Key	Description
`__encoded`	URL-encoded parameters from which the query was generated
`__state`	Query state; see below
`__link`	URL pointing to canonical query metadata, when available
`__result`	URL of the resource containing complete result, when available
`__sources`	Array of PTO URLs of observation sets covered by the query, when available
`_ext_ref`	External reference for a permanence request; see below

A query can have one of following states:

State	Meaning
`submitted`	Submitted, but not yet running
`pending`	Running and awaiting results
`failed`	Abnormally ended without returning results
`complete`	Results are available
`permanent`	Results are available and cached results will be stored permanently

Results

The type of the query determines the format of the results, as below:

Observation Selection Queries

A query created without any group_by or intersect_condition parameters and without the sets_only option is a selection query. The observations returned by this query are those within the interval between the time_start and time_end parameters which match the selection parameters.

[EDITOR'S NOTE: matching rules go here, describe condition wildcards.]

The result of a selection query is a JSON object, the fields of which are as follows:

Key	Value
`prev`	Link to previous page (see Pagination)
`next`	Link to next page (see Pagination)
`obs`	JSON array containing observations in OSF format

Observation Set Selection Queries

A query created without any group_by or intersect_condition parameters and with the sets_only option is a selection query. The observations returned by this query are those within the interval between the time_start and time_end parameters which match the selection parameters.

Key	Value
`prev`	Link to previous page (see Pagination)
`next`	Link to next page (see Pagination)
`sets`	JSON array containing links to observation sets containing observations answering the query

Condition Set Intersection Queries

NOTE: Condition set intersection queries are not yet supported by the PTO. This section documents future functionality.

A query created with one or more intersect_condition parameters is a set intersection query. First, observations matching the selection parameters are selected. Then, the observations are grouped by paths, and only paths within each set listed in intersect_condition parameters are selected.

An intersect_condition parameter is either a full condition name (i.e., without wildcards), or a full condition name prefixed with the character ! (urlencoded %21). In the former case, the set is of all paths for which there is at least one observation with the specified condition. In the latter case, the set is negated, i.e. the set of all paths for which there are no observations with the specified condition. Foe example, a query with two intersect_condition parameters with values foo.bar and !foo.baz will select paths where there is at least one foo.bar observation and no foo.baz observations.

The result of a set intersection query is a JSON object, the fields of which are as follows:

Key	Value
`prev`	Link to previous page (see Pagination)
`next`	Link to next page (see Pagination)
`paths`	JSON array containing paths as strings

Aggregation Queries

A query created with one or more group_by parameters is an aggregation query. The results will return the count of obsevations selected by the select parameters for each group parameter. The following group_by parameters are available:

Value	Meaning
`year`	Count by year of time_start
`month`	Count by year/month of time_start
`day`	Count by year/month/day of time_start
`hour`	Count by year/month/day/hour of time_start
`week`	Count by year/week (starting Monday) of time_start
`week_day`	Count by day of week of time_start (7 groups)
`day_hour`	Count by hour of day of time_start (24 groups)
`condition`	Count by condition
`feature`	Count by feature (first component of condition)
`aspect`	Count by aspect (all but last component of condition)
`value`	Count by condition value
`source`	Count by first element in path
`target`	Count by last element in path

The result of an aggregation query is a JSON object, the fields of which are as follows:

Key	Value
`prev`	Link to previous page (see Pagination)
`next`	Link to next page (see Pagination)
`groups`	List of JSON arrays containing count in final position, by group(s)

Pagination

[EDITOR'S NOTE: review me]

API resources which return lists of URLs or results in arrays in JSON objects support pagination. By default, if more than 20 items will be returned in the list around which a resource is centered, the results will be paginated: the top-level JSON object will gain a next key if the result is not the last page, and a prev key if the result is not the first page. Pagination is controlled using the following GET parameters:

Parameter	Meaning
`page`	Page number, beginning with 0. Defaults to 0

Pagination is applied to the following elements on the following resources:

Resource	Element paginated	Pagination default
`/raw/<c>`	`files`	tbd
`/obs`	`sets`	tbd
`/obs/by_metadata`	`sets`	tbd
`/query`	`queries`	tbd
`/query/<q>/result`	`results`	tbd

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

API.md

API.md

Path Transparency Observatory API Specification

Access Control and Permissions

Raw Data Access and Upload

Metadata

Filetypes

Raw data API usage

Creating and listing campaigns

Uploading Raw Data

Downloading Raw Data

Changing Metadata and Data

Observation Access

Metadata and Provenance

Querying Observation Sets by Metadata

Analyzer Metadata

Observation API usage

Uploading an observation set

Observation Query

Query Options

Metadata

Results

Observation Selection Queries

Observation Set Selection Queries

Condition Set Intersection Queries

Aggregation Queries

Pagination

Files

API.md

Latest commit

History

API.md

File metadata and controls

Path Transparency Observatory API Specification

Access Control and Permissions

Raw Data Access and Upload

Metadata

Filetypes

Raw data API usage

Creating and listing campaigns

Uploading Raw Data

Downloading Raw Data

Changing Metadata and Data

Observation Access

Metadata and Provenance

Querying Observation Sets by Metadata

Analyzer Metadata

Observation API usage

Uploading an observation set

Observation Query

Query Options

Metadata

Results

Observation Selection Queries

Observation Set Selection Queries

Condition Set Intersection Queries

Aggregation Queries

Pagination