Skip to content

Latest commit

 

History

History
619 lines (495 loc) · 31.2 KB

API.md

File metadata and controls

619 lines (495 loc) · 31.2 KB

Path Transparency Observatory API Specification

Path Transparency Observatory API specifiation, version 3

The API consists of three applications: raw data access and upload, observation access, and observation query. The interface to each application is made up of certain resources accessed in a RESTful way; these resources are specified below.

Access Control and Permissions

All applications use API key based access control. An API key is associated with a set of permissions, scoped to permitted operations on the observatory database, on access to raw data for specific campaigns. In this document, the permission required for a given method on a given resource is noted.

API keys are given in the HTTP Authorization request header, which must consist of the string APIKEY followed by whitespace and the API key as a string.

Raw Data Access and Upload

The raw data access and upload API (resources under /raw) allows the upload of raw data files to the PTO, and the later retrieval of those files. Each file is associated with a campaign -- a group of files related to a single measurement campaign from a single data source. Campaigns and files have associated metadata which is used by the PTO and analysis modules to save metadata about the raw data; this may also be used by users of the PTO to store information about raw files and campaigns.

The resources and methods available thereon are summarized in the table below.

Method Resource Permission Description
GET /raw raw_metadata Retrieve URLs for campaigns as JSON
GET /raw/<c> raw_metadata Retrieve metadata for campaign c as JSON
PUT /raw/<c> write_raw:<c> Write metadata for campaign c as JSON
GET /raw/<c>/<f> raw_metadata Retrieve metadata for file f in c as JSON
PUT /raw/<c>/<f> write_raw:<c> Write metadata for file f in c as JSON
GET /raw/<c>/<f>/data read_raw:<c> Retrieve content for file f in c (by convention)
PUT /raw/<c>/<f>/data write_raw:<c> Write content for file f in c (by convention)
DELETE /raw/<c>/<f> write_raw:<c> Delete a file and its metadata
DELETE /raw/<c> write_raw:<c> Delete a campaign and all its files

Metadata

Associated with each file and each campaign in the raw data store is a metadata object. This metadata object is a set of key-value pairs, presented by the API as a JSON object. Certain keys in this object are reserved for use by the system, or are generated by the system. Other metadata keys can be freely used by the creator of a raw data file to communicate metadata information with a future analysis process that will be run on the raw data, or for future retrieval of the file. Metadata key behavior is determined by the metadata key name:

  • All metadata keys whose names begin with an underscore _ are reserved for system use; they may be written to, but have a special meaning to the PTO itself.
  • All metadata keys whose names begin with __ are virtual, and generated by the system; they may not be written to.
  • All other metadata key names are free for use by users and analysis modules.

Files inherit metadata from their containing campaign. If a file's metadata and its containing campaign's metadata have metadata for the same key, the value associated with file overrides that inherited from the campaign for that file.

The following reserved and virtual metadata keys are presently supported:

Key Description
_file_type PTO filetype. See Filetypes, below.
_owner Identity (via email) of user or organization owning the file/campaign
_time_start Timestamp of first observation in the raw data file, in ISO8601 format
_time_end Time of last observation in the raw data file, in ISO8601 format
_deprecated If present, timestamp at which an observation set was marked deprecated
__data URL of the resource containing file data.
__data_size Size of the file in bytes. 0 if the data file has not been uploaded.

Though the data resource is by convention accessible by appending /data to the path of the metadata resource, the system may at any time place data at another path; therefore, clients should only upload data to the path given in the __data metadata key.

Filetypes

Every raw data file has a filetype, given in the _file_type key, which the PTO uses to determine how to handle files internally, and which analysis modules use to determine how to read and whether they are interested in raw data files. Each filetype is associated with a MIME type, and the Content-Type header on data uploads via PUT must match the filetype associated with the file.

Often, all the files within a campaign will share the same filetype. In this case, filetype information is set in campaign metadata, not in individual file metadata.

While filetypes are extensible, the filetypes supported by the PTO as installed are listed below:

Filetype MIME type Description
obs-bz2 application/bzip2 Compressed observations in OSF
obs application/vnd.mami.ndjson Uncompressed observations in OSF

Raw data API usage

We use curl to illustrate the usage of the PTO raw API. We assume the API is rooted at https://pto.example.com/, and that the API key abadc0de holds the permissions raw_metadata, read_raw:test, and write_raw:test. (In these examples, the output of curl is prettyprinted via python3 -m json.tool, not shown)

Creating and listing campaigns

To list the campaigns for which raw data is stored, simply fetch the /raw resource.

$ curl -H "Authorization: APIKEY abadc0de" https://pto.example.com/raw
{
    "campaigns": []
}

This PTO instance is empty: no campaigns are stored here. To create the test campaign, which we are preauthorized to do, simply upload the campaign's metadata at the campaign's path:

$ cat test_campaign.json
{
    "_owner": "you@example.com",
    "_file_type": "test"
}	

$ curl -H "Authorization: APIKEY abadc0de" \
       -H "Content-Type: application/json" \
       -X PUT https://pto.example.com/raw/test \
       --data-binary @test_campaign.json
{
    "_file_type":"test",
    "_owner":"you@example.com"
}

The reply echoes back the metadata uploaded. A campaign's metadata can be changed by simply uploading new metadata.

We can verify that our campaign has been created by listing campaigns again:

$ curl -H "Authorization: APIKEY abadc0de" https://pto.example.com/raw
{
    "campaigns": [
        "http://pto.example.com/raw/test"
    ]
}

Uploading Raw Data

Once a campaign has been created, uploading raw data to it is a two-step process: creating a new file by uploading its metadata, then uploading the data associated with the file.

For the purposes of this example, we'll upload a single test data file containing some JSON formatted data. First the metadata:

$ cat test_metadata.json
{
    "_time_start": "2018-04-25T10:15:35Z",
    "_time_end":   "2018-04-25T10:20:48Z",
    "purpose":     "demonstrate file upload"
}

$ curl -H "Authorization: APIKEY abadc0de" \
       -H "Content-Type: application/json" \
       -X PUT https://pto.example.com/raw/test/test001.json \
       --data-binary @test_metadata.json
{
    "__data": "https://pto.example.com/raw/test/test001.json/data",
    "_file_type": "test",
    "_owner": "you@example.com",
    "_time_end": "2018-04-25T10:20:48Z",
    "_time_start": "2018-04-25T10:15:35Z",
    "purpose": "demonstrate file upload"
}

Here the uploaded metadata, including keys inherited from the campaign, is echoed back from the server, along with a link to which data can be uploaded (in the __data key).

Now that the metadata is created, we can upload the data file to the given URL:

$ curl -H "Authorization: APIKEY abadc0de" \
       -H "Content-Type: application/json" \
       -X PUT https://pto.example.com/raw/test/test001.json/data \
       --data-binary @test_data.json
{
    "__data": "https://pto.example.com/raw/test/test001.json/data",
    "__data_size": 37,
    "_file_type": "test",
    "_owner": "you@example.com",
    "_time_end": "2018-04-25T10:20:48Z",
    "_time_start": "2018-04-25T10:15:35Z",
    "purpose": "demonstrate file upload"
}

This echoes back the metadata for the file. Note here the new __data_size key, which gives the size of the data file in bytes.

Downloading Raw Data

While the current PTO implementation by convention always generates data URLs from metadata URLs by appending /data to the path, this is not guaranteed to always be the case, so it's important to check the __data key in the metadata for the file before downloading. Here we assign this to a shell variable, then download from that url:

$ DATAURL=`curl -s -H "Authorization: APIKEY abadc0de" \
                https://pto.example.com/raw/test/test001.json | \
           python3 -c 'import sys, json; print(json.load(sys.stdin)["__data"])'`
$ curl -H "Authorization: APIKEY abadc0de" $DATAURL > downloaded_file.json

Changing Metadata and Data

Metadata can be changed by uploading a new metadata object.

Once a file has been uploaded, its data can no longer be changed.

Files can currently not be deleted via the API, though a file can be fully deleted by removing the file and metadata file from the filesystem backing the raw data store. To mark a file (or a campaign) as no longer valid, use the _deprecated system metadata tag.

Observation Access

The observation access API (resources under /obs) allows access to PTO observations. An observation is a tuple consisting of the following elements:

  • a time interval during which the observation is valid, consisting of a start time and an end time;
  • a path on which the observation is made, consisting of a sequence of path elements;
  • a condition observed on this path; and
  • an optional value associated with the condition.

More about the PTO's information model is given here

Observations are grouped into observation sets. An observation set is a set of observations resulting from a single run of an analyser on some input data (see Data Analysis, below). All observations in an observation set share the same metadata and provenance. Provenance provides information about the source data that was analyzed (in terms of raw data files and/or other observation sets stored in the PTO) and the analysis that was performed.

The resources and methods available thereon are summarized in the table below.

Method Resource Permission Description
GET /obs read_obs Retrieve URLs for observation sets as JSON
GET /obs/by_metadata read_obs Retrieve URLs for observation sets by metadata
GET /obs/conditions read_obs List conditions in observation database
POST /obs/create write_obs Create new observation set
GET /obs/<o> read_obs Retrieve metadata and provenance for o as JSON
PUT /obs/<o> write_obs Update metadata and provenance for o as JSON
GET /obs/<o>/data read_obs_data Retrieve obset file for o as NDJSON (by convention)
PUT /obs/<o>/data write_obs Upload obset file for o as NDJSON (by convention)

Metadata and Provenance

As with raw data files, observation sets have associated metadata; as with raw data files, arbitrary metadata can be set on observation sets by the analyses that create them. The same rules for reserved and virtual metadata names apply for observation sets as for raw data. The metadata for an observation set also contains information about the observation set's provenance.

Provenance is tracked by three metadata keys. _sources is an array of URLs referencing the raw data file or the observation set(s) from which the observations in an observation set are derived. _analyzer is a URL referring to the analyzer that created the observation set. _campaign is a URL referring to the campaign from which the original raw data was derived, and is only present if the observation set is derived only from observation sets / raw data in the same campaign.

The following reserved and virtual metadata keys are presently supported:

Key Description
_sources Array of PTO URLs of raw data sources and observation sets
_analyzer URL of analyzer metadata
_conditions Array of conditions declared in the observation set
_deprecated If present, timestamp at which an observation set was marked deprecated
__obs_count Count of observations in the observation set
__time_start Timestamp of first observation start time in set
__time_end Timestamp of last observation end time in set
__data URL of the resource containing observation set data

Querying Observation Sets by Metadata

The /obs/by_metadata resource lists links to Observation Sets based on the presence or value of metadata keys, or on the values of particular metadata. The following query parameters are supported:

Key Description
k Obsets containing metadata key (with v, of a specific value)
v Value to query (use with k)
source Obsets derived from a source URL starting with a given prefix
analyzer Obsets derived from an analyzer whose metadata URL starts with a given prefix
condition Obsets declaring a given condition

When multiple parameters are given, the intersection of observation sets fulfilling all parameters is returned.

Analyzer Metadata

Observations refer to how they were created via the _analyzer metadata key. This either contains a URL pointing at an analyzer metadata object, or a URL pointing at a source code repository, including tag referring to a specific revision or commit, which itself contains analyzer metadata in a file named __pto_analyzer_metadata.json in its root directory.

Analyzer metadata is stored as a JSON object, which must have the following keys:

Key Description
_repository URL of source code repository, if not implicit
_owner Identity (via email) of user or organization owning the analyzer
_file_types File types consumable by raw analyzer, as array
_invocation Command to run in repository root to invoke the analyzer, if local
_platform Platform identifier; see interface description

As with raw and observation metadata, all keys not beginning with _ are freeform, and may be used to store other information about the analyzer.

An analyzer can be one of two types. A raw analyzer (or normalizer) reads in raw data of one or more filetypes, and produces observations. A derived analyzer reads in observations from one or more sets, and produces observations. Raw analyzer metadata must contain a _file_types key, a JSON array listing PTO filetypes it can consume; derived analyzer metadata must not contain this key.

Analyzers can take one of two forms. A local analyzer is designed to run within the PTO, with direct access to raw data files or observations. It reads raw data or observation files on standard input, and produces observation files and observation set metadata on standard output. Local analyzer metadata must contain an _invocation key, which is a command to run in the repository root to invoke the analyzer. More on the interface for local analyzers is given here.

A local analyzer runtime is not yet available in the PTO; local analyzers can be run manually with the command-line tools described here.

A client analyzer is designed to use the PTO API to retrieve raw data and observation sets and upload its results. As it cannot be automatically invoked, client analyzer repositories should contain human-readable documentation for invocation. Client analyzer metadata must not contain an _invocation key.

See the analyzer interface description for more.

Observation API usage

As above, we use curl to illustrate the usage of the PTO observation API. We assume the API is rooted at https://pto.example.com/, and that the API key abadc0de holds the permissions read_obs and write_obs.

Uploading an observation set

First, create observation set metadata, and upload it to get a new observation set ID. The observation set metadata must contain the full list of conditions contained in the observation set, a link to analyzer metadata (see the analyzer interface), and a list of links to raw data sources from which the observations were derived

$ cat obs_metadata.json
{
  "_conditions": ["pto.test.ok","pto.test.not_ok"],
  "_analyzer":   "https://gitlab.example.com/analyzers/test_analyzer/raw/master/analyzer_meta.json"
  "_sources":    ["https://pto.example.com/raw/test"]
}

$ curl -H "Authorization: APIKEY abadc0de" \
       -H "Content-Type: application/json" \
       -X POST https://pto.example.com/obs/create \
       --data-binary @obs_metadata.json
{
  "__data": "http://localhost:8383/obs/1/data",
  "__link": "http://localhost:8383/obs/1",
  "__modified": "2018-06-07T08:29:26Z",
  "__created": "2018-06-07T08:29:26Z",
  "_conditions": ["pto.test.ok","pto.test.not_ok"],
  "_analyzer":   "https://gitlab.example.com/analyzers/test_analyzer/raw/master/analyzer_meta.json"
  "_sources":    ["https://pto.example.com/raw/test"]
}

This returns the metadata, including new system metadata. The __link key here is a permanent link to the created observation set, including its ID, and the __data key is a path to which observation set data can be uploaded. Let's do that.

$ curl -H "Authorization: APIKEY abadc0de" \
       -H "Content-Type: application/vnd.mami.ndjson" \
       -X PUT https://pto.example.com/obs/1/data \
       --data-binary @obs_data.ndjson
{
  "__data": "http://localhost:8383/obs/1/data",
  "__link": "http://localhost:8383/obs/1",
  "__modified": "2018-06-07T08:31:14Z",
  "__created": "2018-06-07T08:31:14Z",
  "__obs_count": 2,
  "_conditions": ["pto.test.ok","pto.test.not_ok"],
  "_analyzer":   "https://gitlab.example.com/analyzers/test_analyzer/raw/master/analyzer_meta.json"
  "_sources":    ["https://pto.example.com/raw/test"]
}

Similar to uploading a raw data file, the new __obs_count metadata key shows the number of observations that have been stored.

Observation Query

The observation query API (resources under /query) allows the submission of queries to the PTO, to retrieve observations and observation aggregates meeting certain criteria from the PTO's observation database.

Method Resource Permission Description
POST or GET /query/submit submit_query_obs or submit_query_group Submit a query
GET /query read_query List currently cached and pending queries
GET /query/<q> read_query Get query metadata, including ETA for pending queries
GET /query/<q>/result read_query Get query results (by convention)
PUT /query/<q> update_query Update query metadata

Queries can be submitted by POSTing to the /query/submit resource. The query itself is defined by a the parameters in the POSTed application/x-www-form-urlencoded content. The parameters are summarized below:

Parameter Semantics Multiple? Meaning
time_start temporal no Select observations starting at or after the given start time
time_end temporal no Select observations ending at or before the given end time
set select yes Select observations with in the given set ID
on_path select yes Select observations with the given element in the path
source select yes Select observations with the given element at the start of the path
target select yes Select observations with the given element at the end of the path
condition select yes Select observations with the given condition, with wildcards
feature select yes Select observations with the given condition feature
aspect select yes Select observations with the given condition aspect
group group yes Group observations and return counts by group
intersect_condition set yes Group observations by path, select paths by set intersection on conditions
option options yes Specify a query option

All parameters with temporal semantics must be present, and are used to bound the query in time. Parameters with select semantics may be given to filter observations. if multiple instances of a select parameter are available, any of the values will match; however, an observation must match at least one of the values for each distinct parameter given (i.e., the query language supports AND of OR semantics). Parameters with group or set semantics, as well as the option parameter, may modify the type of query and the format of its results; see the Results section below.

Query Options

The option parameter is used to modify the behavior of queries. Multiple Options may be present. The following options are presently supported:

Option Value Behavior
sets_only Return links to observation sets containing observations answering the query, instead of observation data directly
count_targets Group queries should count distinct targets, not distinct observations

Metadata

When a query is submitted, it goes into the query cache. The query cache holds the query metadata until the query has been scheduled to run. Once it has run, the query metadata will updated to point to the result and the observation sets the query answers.

Key Description
__encoded URL-encoded parameters from which the query was generated
__state Query state; see below
__link URL pointing to canonical query metadata, when available
__result URL of the resource containing complete result, when available
__sources Array of PTO URLs of observation sets covered by the query, when available
_ext_ref External reference for a permanence request; see below

A query can have one of following states:

State Meaning
submitted Submitted, but not yet running
pending Running and awaiting results
failed Abnormally ended without returning results
complete Results are available
permanent Results are available and cached results will be stored permanently

Results

The type of the query determines the format of the results, as below:

Observation Selection Queries

A query created without any group_by or intersect_condition parameters and without the sets_only option is a selection query. The observations returned by this query are those within the interval between the time_start and time_end parameters which match the selection parameters.

[EDITOR'S NOTE: matching rules go here, describe condition wildcards.]

The result of a selection query is a JSON object, the fields of which are as follows:

Key Value
prev Link to previous page (see Pagination)
next Link to next page (see Pagination)
obs JSON array containing observations in OSF format

Observation Set Selection Queries

A query created without any group_by or intersect_condition parameters and with the sets_only option is a selection query. The observations returned by this query are those within the interval between the time_start and time_end parameters which match the selection parameters.

Key Value
prev Link to previous page (see Pagination)
next Link to next page (see Pagination)
sets JSON array containing links to observation sets containing observations answering the query

Condition Set Intersection Queries

NOTE: Condition set intersection queries are not yet supported by the PTO. This section documents future functionality.

A query created with one or more intersect_condition parameters is a set intersection query. First, observations matching the selection parameters are selected. Then, the observations are grouped by paths, and only paths within each set listed in intersect_condition parameters are selected.

An intersect_condition parameter is either a full condition name (i.e., without wildcards), or a full condition name prefixed with the character ! (urlencoded %21). In the former case, the set is of all paths for which there is at least one observation with the specified condition. In the latter case, the set is negated, i.e. the set of all paths for which there are no observations with the specified condition. Foe example, a query with two intersect_condition parameters with values foo.bar and !foo.baz will select paths where there is at least one foo.bar observation and no foo.baz observations.

The result of a set intersection query is a JSON object, the fields of which are as follows:

Key Value
prev Link to previous page (see Pagination)
next Link to next page (see Pagination)
paths JSON array containing paths as strings

Aggregation Queries

A query created with one or more group_by parameters is an aggregation query. The results will return the count of obsevations selected by the select parameters for each group parameter. The following group_by parameters are available:

Value Meaning
year Count by year of time_start
month Count by year/month of time_start
day Count by year/month/day of time_start
hour Count by year/month/day/hour of time_start
week Count by year/week (starting Monday) of time_start
week_day Count by day of week of time_start (7 groups)
day_hour Count by hour of day of time_start (24 groups)
condition Count by condition
feature Count by feature (first component of condition)
aspect Count by aspect (all but last component of condition)
value Count by condition value
source Count by first element in path
target Count by last element in path

The result of an aggregation query is a JSON object, the fields of which are as follows:

Key Value
prev Link to previous page (see Pagination)
next Link to next page (see Pagination)
groups List of JSON arrays containing count in final position, by group(s)

Pagination

[EDITOR'S NOTE: review me]

API resources which return lists of URLs or results in arrays in JSON objects support pagination. By default, if more than 20 items will be returned in the list around which a resource is centered, the results will be paginated: the top-level JSON object will gain a next key if the result is not the last page, and a prev key if the result is not the first page. Pagination is controlled using the following GET parameters:

Parameter Meaning
page Page number, beginning with 0. Defaults to 0

Pagination is applied to the following elements on the following resources:

Resource Element paginated Pagination default
/raw/<c> files tbd
/obs sets tbd
/obs/by_metadata sets tbd
/query queries tbd
/query/<q>/result results tbd