Path Transparency Observatory API specifiation, version 3
The API consists of three applications: raw data access and upload, observation access, and observation query. The interface to each application is made up of certain resources accessed in a RESTful way; these resources are specified below.
All applications use API key based access control. An API key is associated with a set of permissions, scoped to permitted operations on the observatory database, on access to raw data for specific campaigns. In this document, the permission required for a given method on a given resource is noted.
API keys are given in the HTTP Authorization
request header, which must
consist of the string APIKEY
followed by whitespace and the API key as a
string.
The raw data access and upload API (resources under /raw
) allows the upload of
raw data files to the PTO, and the later retrieval of those files. Each file is
associated with a campaign -- a group of files related to a single measurement
campaign from a single data source. Campaigns and files have associated
metadata which is used by the PTO and analysis modules to save metadata about
the raw data; this may also be used by users of the PTO to store information
about raw files and campaigns.
The resources and methods available thereon are summarized in the table below.
Method | Resource | Permission | Description |
---|---|---|---|
GET |
/raw |
raw_metadata |
Retrieve URLs for campaigns as JSON |
GET |
/raw/<c> |
raw_metadata |
Retrieve metadata for campaign c as JSON |
PUT |
/raw/<c> |
write_raw:<c> |
Write metadata for campaign c as JSON |
GET |
/raw/<c>/<f> |
raw_metadata |
Retrieve metadata for file f in c as JSON |
PUT |
/raw/<c>/<f> |
write_raw:<c> |
Write metadata for file f in c as JSON |
GET |
/raw/<c>/<f>/data |
read_raw:<c> |
Retrieve content for file f in c (by convention) |
PUT |
/raw/<c>/<f>/data |
write_raw:<c> |
Write content for file f in c (by convention) |
DELETE |
/raw/<c>/<f> |
write_raw:<c> |
Delete a file and its metadata |
DELETE |
/raw/<c> |
write_raw:<c> |
Delete a campaign and all its files |
Associated with each file and each campaign in the raw data store is a metadata object. This metadata object is a set of key-value pairs, presented by the API as a JSON object. Certain keys in this object are reserved for use by the system, or are generated by the system. Other metadata keys can be freely used by the creator of a raw data file to communicate metadata information with a future analysis process that will be run on the raw data, or for future retrieval of the file. Metadata key behavior is determined by the metadata key name:
- All metadata keys whose names begin with an underscore
_
are reserved for system use; they may be written to, but have a special meaning to the PTO itself. - All metadata keys whose names begin with
__
are virtual, and generated by the system; they may not be written to. - All other metadata key names are free for use by users and analysis modules.
Files inherit metadata from their containing campaign. If a file's metadata and its containing campaign's metadata have metadata for the same key, the value associated with file overrides that inherited from the campaign for that file.
The following reserved and virtual metadata keys are presently supported:
Key | Description |
---|---|
_file_type |
PTO filetype. See Filetypes, below. |
_owner |
Identity (via email) of user or organization owning the file/campaign |
_time_start |
Timestamp of first observation in the raw data file, in ISO8601 format |
_time_end |
Time of last observation in the raw data file, in ISO8601 format |
_deprecated |
If present, timestamp at which an observation set was marked deprecated |
__data |
URL of the resource containing file data. |
__data_size |
Size of the file in bytes. 0 if the data file has not been uploaded. |
Though the data resource is by convention accessible by appending /data
to the
path of the metadata resource, the system may at any time place data at another
path; therefore, clients should only upload data to the path given in the
__data
metadata key.
Every raw data file has a filetype, given in the _file_type
key, which the
PTO uses to determine how to handle files internally, and which analysis modules
use to determine how to read and whether they are interested in raw data files.
Each filetype is associated with a MIME type, and the Content-Type
header on
data uploads via PUT must match the filetype associated with the file.
Often, all the files within a campaign will share the same filetype. In this case, filetype information is set in campaign metadata, not in individual file metadata.
While filetypes are extensible, the filetypes supported by the PTO as installed are listed below:
Filetype | MIME type | Description |
---|---|---|
obs-bz2 |
application/bzip2 |
Compressed observations in OSF |
obs |
application/vnd.mami.ndjson |
Uncompressed observations in OSF |
We use curl to illustrate the usage of the PTO raw
API. We assume the API is rooted at https://pto.example.com/
, and that the
API key abadc0de
holds the permissions raw_metadata
, read_raw:test
, and
write_raw:test
. (In these examples, the output of curl is prettyprinted via
python3 -m json.tool
, not shown)
To list the campaigns for which raw data is stored, simply fetch the /raw
resource.
$ curl -H "Authorization: APIKEY abadc0de" https://pto.example.com/raw
{
"campaigns": []
}
This PTO instance is empty: no campaigns are stored here. To create the test
campaign, which we are preauthorized to do, simply upload the campaign's
metadata at the campaign's path:
$ cat test_campaign.json
{
"_owner": "you@example.com",
"_file_type": "test"
}
$ curl -H "Authorization: APIKEY abadc0de" \
-H "Content-Type: application/json" \
-X PUT https://pto.example.com/raw/test \
--data-binary @test_campaign.json
{
"_file_type":"test",
"_owner":"you@example.com"
}
The reply echoes back the metadata uploaded. A campaign's metadata can be changed by simply uploading new metadata.
We can verify that our campaign has been created by listing campaigns again:
$ curl -H "Authorization: APIKEY abadc0de" https://pto.example.com/raw
{
"campaigns": [
"http://pto.example.com/raw/test"
]
}
Once a campaign has been created, uploading raw data to it is a two-step process: creating a new file by uploading its metadata, then uploading the data associated with the file.
For the purposes of this example, we'll upload a single test data file containing some JSON formatted data. First the metadata:
$ cat test_metadata.json
{
"_time_start": "2018-04-25T10:15:35Z",
"_time_end": "2018-04-25T10:20:48Z",
"purpose": "demonstrate file upload"
}
$ curl -H "Authorization: APIKEY abadc0de" \
-H "Content-Type: application/json" \
-X PUT https://pto.example.com/raw/test/test001.json \
--data-binary @test_metadata.json
{
"__data": "https://pto.example.com/raw/test/test001.json/data",
"_file_type": "test",
"_owner": "you@example.com",
"_time_end": "2018-04-25T10:20:48Z",
"_time_start": "2018-04-25T10:15:35Z",
"purpose": "demonstrate file upload"
}
Here the uploaded metadata, including keys inherited from the campaign, is
echoed back from the server, along with a link to which data can be uploaded
(in the __data
key).
Now that the metadata is created, we can upload the data file to the given URL:
$ curl -H "Authorization: APIKEY abadc0de" \
-H "Content-Type: application/json" \
-X PUT https://pto.example.com/raw/test/test001.json/data \
--data-binary @test_data.json
{
"__data": "https://pto.example.com/raw/test/test001.json/data",
"__data_size": 37,
"_file_type": "test",
"_owner": "you@example.com",
"_time_end": "2018-04-25T10:20:48Z",
"_time_start": "2018-04-25T10:15:35Z",
"purpose": "demonstrate file upload"
}
This echoes back the metadata for the file. Note here the new __data_size
key, which gives the size of the data file in bytes.
While the current PTO implementation by convention always generates data URLs
from metadata URLs by appending /data
to the path, this is not guaranteed to
always be the case, so it's important to check the __data
key in the
metadata for the file before downloading. Here we assign this to a shell
variable, then download from that url:
$ DATAURL=`curl -s -H "Authorization: APIKEY abadc0de" \
https://pto.example.com/raw/test/test001.json | \
python3 -c 'import sys, json; print(json.load(sys.stdin)["__data"])'`
$ curl -H "Authorization: APIKEY abadc0de" $DATAURL > downloaded_file.json
Metadata can be changed by uploading a new metadata object.
Once a file has been uploaded, its data can no longer be changed.
Files can currently not be deleted via the API, though a file can be fully
deleted by removing the file and metadata file from the filesystem backing the
raw data store. To mark a file (or a campaign) as no longer valid, use the
_deprecated
system metadata tag.
The observation access API (resources under /obs
) allows access to PTO
observations. An observation is a tuple consisting of the following elements:
- a time interval during which the observation is valid, consisting of a start time and an end time;
- a path on which the observation is made, consisting of a sequence of path elements;
- a condition observed on this path; and
- an optional value associated with the condition.
More about the PTO's information model is given here
Observations are grouped into observation sets. An observation set is a set of observations resulting from a single run of an analyser on some input data (see Data Analysis, below). All observations in an observation set share the same metadata and provenance. Provenance provides information about the source data that was analyzed (in terms of raw data files and/or other observation sets stored in the PTO) and the analysis that was performed.
The resources and methods available thereon are summarized in the table below.
Method | Resource | Permission | Description |
---|---|---|---|
GET |
/obs |
read_obs |
Retrieve URLs for observation sets as JSON |
GET |
/obs/by_metadata |
read_obs |
Retrieve URLs for observation sets by metadata |
GET |
/obs/conditions |
read_obs |
List conditions in observation database |
POST |
/obs/create |
write_obs |
Create new observation set |
GET |
/obs/<o> |
read_obs |
Retrieve metadata and provenance for o as JSON |
PUT |
/obs/<o> |
write_obs |
Update metadata and provenance for o as JSON |
GET |
/obs/<o>/data |
read_obs_data |
Retrieve obset file for o as NDJSON (by convention) |
PUT |
/obs/<o>/data |
write_obs |
Upload obset file for o as NDJSON (by convention) |
As with raw data files, observation sets have associated metadata; as with raw data files, arbitrary metadata can be set on observation sets by the analyses that create them. The same rules for reserved and virtual metadata names apply for observation sets as for raw data. The metadata for an observation set also contains information about the observation set's provenance.
Provenance is tracked by three metadata keys. _sources
is an array of URLs
referencing the raw data file or the observation set(s) from which the
observations in an observation set are derived. _analyzer
is a URL referring
to the analyzer that created the observation set. _campaign
is a URL referring
to the campaign from which the original raw data was derived, and is only
present if the observation set is derived only from observation sets / raw data
in the same campaign.
The following reserved and virtual metadata keys are presently supported:
Key | Description |
---|---|
_sources |
Array of PTO URLs of raw data sources and observation sets |
_analyzer |
URL of analyzer metadata |
_conditions |
Array of conditions declared in the observation set |
_deprecated |
If present, timestamp at which an observation set was marked deprecated |
__obs_count |
Count of observations in the observation set |
__time_start |
Timestamp of first observation start time in set |
__time_end |
Timestamp of last observation end time in set |
__data |
URL of the resource containing observation set data |
The /obs/by_metadata
resource lists links to Observation Sets based on the
presence or value of metadata keys, or on the values of particular metadata.
The following query parameters are supported:
Key | Description |
---|---|
k |
Obsets containing metadata key (with v , of a specific value) |
v |
Value to query (use with k ) |
source |
Obsets derived from a source URL starting with a given prefix |
analyzer |
Obsets derived from an analyzer whose metadata URL starts with a given prefix |
condition |
Obsets declaring a given condition |
When multiple parameters are given, the intersection of observation sets fulfilling all parameters is returned.
Observations refer to how they were created via the _analyzer
metadata key.
This either contains a URL pointing at an analyzer metadata object, or a URL
pointing at a source code repository, including tag referring to a specific
revision or commit, which itself contains analyzer metadata in a file named
__pto_analyzer_metadata.json
in its root directory.
Analyzer metadata is stored as a JSON object, which must have the following keys:
Key | Description |
---|---|
_repository |
URL of source code repository, if not implicit |
_owner |
Identity (via email) of user or organization owning the analyzer |
_file_types |
File types consumable by raw analyzer, as array |
_invocation |
Command to run in repository root to invoke the analyzer, if local |
_platform |
Platform identifier; see interface description |
As with raw and observation metadata, all keys not beginning with _
are
freeform, and may be used to store other information about the analyzer.
An analyzer can be one of two types. A raw analyzer (or normalizer) reads
in raw data of one or more filetypes, and produces observations. A derived
analyzer reads in observations from one or more sets, and produces
observations. Raw analyzer metadata must contain a _file_types
key, a JSON
array listing PTO filetypes it can consume; derived analyzer metadata must not
contain this key.
Analyzers can take one of two forms. A local analyzer is designed to run
within the PTO, with direct access to raw data files or observations. It reads
raw data or observation files on standard input, and produces
observation files and observation set metadata on standard output. Local
analyzer metadata must contain an _invocation
key, which is a command to run
in the repository root to invoke the analyzer. More on the interface for local
analyzers is given here.
A local analyzer runtime is not yet available in the PTO; local analyzers can be run manually with the command-line tools described here.
A client analyzer is designed to use the PTO API to retrieve raw data and
observation sets and upload its results. As it cannot be automatically
invoked, client analyzer repositories should contain human-readable
documentation for invocation. Client analyzer metadata must not contain an
_invocation
key.
See the analyzer interface description for more.
As above, we use curl to illustrate the usage of the
PTO observation API. We assume the API is rooted at
https://pto.example.com/
, and that the API key abadc0de
holds the
permissions read_obs
and write_obs
.
First, create observation set metadata, and upload it to get a new observation set ID. The observation set metadata must contain the full list of conditions contained in the observation set, a link to analyzer metadata (see the analyzer interface), and a list of links to raw data sources from which the observations were derived
$ cat obs_metadata.json
{
"_conditions": ["pto.test.ok","pto.test.not_ok"],
"_analyzer": "https://gitlab.example.com/analyzers/test_analyzer/raw/master/analyzer_meta.json"
"_sources": ["https://pto.example.com/raw/test"]
}
$ curl -H "Authorization: APIKEY abadc0de" \
-H "Content-Type: application/json" \
-X POST https://pto.example.com/obs/create \
--data-binary @obs_metadata.json
{
"__data": "http://localhost:8383/obs/1/data",
"__link": "http://localhost:8383/obs/1",
"__modified": "2018-06-07T08:29:26Z",
"__created": "2018-06-07T08:29:26Z",
"_conditions": ["pto.test.ok","pto.test.not_ok"],
"_analyzer": "https://gitlab.example.com/analyzers/test_analyzer/raw/master/analyzer_meta.json"
"_sources": ["https://pto.example.com/raw/test"]
}
This returns the metadata, including new system metadata. The __link
key here is a permanent link to the created observation set, including its ID, and the __data
key is a path to which observation set data can be uploaded. Let's do that.
$ curl -H "Authorization: APIKEY abadc0de" \
-H "Content-Type: application/vnd.mami.ndjson" \
-X PUT https://pto.example.com/obs/1/data \
--data-binary @obs_data.ndjson
{
"__data": "http://localhost:8383/obs/1/data",
"__link": "http://localhost:8383/obs/1",
"__modified": "2018-06-07T08:31:14Z",
"__created": "2018-06-07T08:31:14Z",
"__obs_count": 2,
"_conditions": ["pto.test.ok","pto.test.not_ok"],
"_analyzer": "https://gitlab.example.com/analyzers/test_analyzer/raw/master/analyzer_meta.json"
"_sources": ["https://pto.example.com/raw/test"]
}
Similar to uploading a raw data file, the new __obs_count
metadata key shows
the number of observations that have been stored.
The observation query API (resources under /query
) allows the submission of
queries to the PTO, to retrieve observations and observation aggregates
meeting certain criteria from the PTO's observation database.
Method | Resource | Permission | Description |
---|---|---|---|
POST or GET |
/query/submit |
submit_query_obs or submit_query_group |
Submit a query |
GET |
/query |
read_query |
List currently cached and pending queries |
GET |
/query/<q> |
read_query |
Get query metadata, including ETA for pending queries |
GET |
/query/<q>/result |
read_query |
Get query results (by convention) |
PUT |
/query/<q> |
update_query |
Update query metadata |
Queries can be submitted by POSTing to the /query/submit resource. The query
itself is defined by a the parameters in the POSTed
application/x-www-form-urlencoded
content. The parameters are summarized
below:
Parameter | Semantics | Multiple? | Meaning |
---|---|---|---|
time_start |
temporal | no | Select observations starting at or after the given start time |
time_end |
temporal | no | Select observations ending at or before the given end time |
set |
select | yes | Select observations with in the given set ID |
on_path |
select | yes | Select observations with the given element in the path |
source |
select | yes | Select observations with the given element at the start of the path |
target |
select | yes | Select observations with the given element at the end of the path |
condition |
select | yes | Select observations with the given condition, with wildcards |
feature |
select | yes | Select observations with the given condition feature |
aspect |
select | yes | Select observations with the given condition aspect |
group |
group | yes | Group observations and return counts by group |
intersect_condition |
set | yes | Group observations by path, select paths by set intersection on conditions |
option |
options | yes | Specify a query option |
All parameters with temporal semantics must be present, and are used to bound the query in time. Parameters with select semantics may be given to filter observations. if multiple instances of a select parameter are available, any of the values will match; however, an observation must match at least one of the values for each distinct parameter given (i.e., the query language supports AND of OR semantics). Parameters with group or set semantics, as well as the option parameter, may modify the type of query and the format of its results; see the Results section below.
The option
parameter is used to modify the behavior of queries. Multiple Options may be present. The following options are presently supported:
Option Value | Behavior |
---|---|
sets_only |
Return links to observation sets containing observations answering the query, instead of observation data directly |
count_targets |
Group queries should count distinct targets, not distinct observations |
When a query is submitted, it goes into the query cache. The query cache holds the query metadata until the query has been scheduled to run. Once it has run, the query metadata will updated to point to the result and the observation sets the query answers.
Key | Description |
---|---|
__encoded |
URL-encoded parameters from which the query was generated |
__state |
Query state; see below |
__link |
URL pointing to canonical query metadata, when available |
__result |
URL of the resource containing complete result, when available |
__sources |
Array of PTO URLs of observation sets covered by the query, when available |
_ext_ref |
External reference for a permanence request; see below |
A query can have one of following states:
State | Meaning |
---|---|
submitted |
Submitted, but not yet running |
pending |
Running and awaiting results |
failed |
Abnormally ended without returning results |
complete |
Results are available |
permanent |
Results are available and cached results will be stored permanently |
The type of the query determines the format of the results, as below:
A query created without any group_by
or intersect_condition
parameters and
without the sets_only
option is a selection query. The observations returned
by this query are those within the interval between the time_start
and
time_end
parameters which match the selection parameters.
[EDITOR'S NOTE: matching rules go here, describe condition wildcards.]
The result of a selection query is a JSON object, the fields of which are as follows:
Key | Value |
---|---|
prev |
Link to previous page (see Pagination) |
next |
Link to next page (see Pagination) |
obs |
JSON array containing observations in OSF format |
A query created without any group_by
or intersect_condition
parameters and
with the sets_only
option is a selection query. The observations returned
by this query are those within the interval between the time_start
and
time_end
parameters which match the selection parameters.
Key | Value |
---|---|
prev |
Link to previous page (see Pagination) |
next |
Link to next page (see Pagination) |
sets |
JSON array containing links to observation sets containing observations answering the query |
NOTE: Condition set intersection queries are not yet supported by the PTO. This section documents future functionality.
A query created with one or more intersect_condition
parameters is a set
intersection query. First, observations matching the selection parameters are
selected. Then, the observations are grouped by paths, and only paths within
each set listed in intersect_condition
parameters are selected.
An intersect_condition
parameter is either a full condition name (i.e.,
without wildcards), or a full condition name prefixed with the character !
(urlencoded %21
). In the former case, the set is of all paths for which there
is at least one observation with the specified condition. In the latter case,
the set is negated, i.e. the set of all paths for which there are no
observations with the specified condition. Foe example, a query with two
intersect_condition
parameters with values foo.bar
and !foo.baz
will
select paths where there is at least one foo.bar
observation and no foo.baz
observations.
The result of a set intersection query is a JSON object, the fields of which are as follows:
Key | Value |
---|---|
prev |
Link to previous page (see Pagination) |
next |
Link to next page (see Pagination) |
paths |
JSON array containing paths as strings |
A query created with one or more group_by
parameters is an aggregation query.
The results will return the count of obsevations selected by the select
parameters for each group parameter. The following group_by
parameters are
available:
Value | Meaning |
---|---|
year |
Count by year of time_start |
month |
Count by year/month of time_start |
day |
Count by year/month/day of time_start |
hour |
Count by year/month/day/hour of time_start |
week |
Count by year/week (starting Monday) of time_start |
week_day |
Count by day of week of time_start (7 groups) |
day_hour |
Count by hour of day of time_start (24 groups) |
condition |
Count by condition |
feature |
Count by feature (first component of condition) |
aspect |
Count by aspect (all but last component of condition) |
value |
Count by condition value |
source |
Count by first element in path |
target |
Count by last element in path |
The result of an aggregation query is a JSON object, the fields of which are as follows:
Key | Value |
---|---|
prev |
Link to previous page (see Pagination) |
next |
Link to next page (see Pagination) |
groups |
List of JSON arrays containing count in final position, by group(s) |
[EDITOR'S NOTE: review me]
API resources which return lists of URLs or results in arrays in JSON objects
support pagination. By default, if more than 20 items will be returned in the
list around which a resource is centered, the results will be paginated: the
top-level JSON object will gain a next
key if the result is not the last page,
and a prev
key if the result is not the first page. Pagination is controlled
using the following GET parameters:
Parameter | Meaning |
---|---|
page |
Page number, beginning with 0. Defaults to 0 |
Pagination is applied to the following elements on the following resources:
Resource | Element paginated | Pagination default |
---|---|---|
/raw/<c> |
files |
tbd |
/obs |
sets |
tbd |
/obs/by_metadata |
sets |
tbd |
/query |
queries |
tbd |
/query/<q>/result |
results |
tbd |