Skip to content

Commit

Permalink
Merge branch 'master' into develop
Browse files Browse the repository at this point in the history
  • Loading branch information
briandoconnor committed May 20, 2017
2 parents cfcc899 + e0a59a6 commit 0ef2085
Show file tree
Hide file tree
Showing 11 changed files with 822 additions and 22 deletions.
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
env
3 changes: 3 additions & 0 deletions .gitmodules
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
[submodule "task-execution-schemas"]
path = task-execution-schemas
url = https://github.com/ga4gh/task-execution-schemas.git
9 changes: 9 additions & 0 deletions CWLFile
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
includes:
- class: Directory
location: proto
- class: Directory
location: task-execution-schemas/proto
proto:
class: File
location: proto/workflow_execution.proto
cwl:tool: protoc.cwl
16 changes: 16 additions & 0 deletions Definitions.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
#Definition Of Terms

##TaskInstance
A *TaskInstance* represents a single command line invocation by an underlying execution platform such as SGE, LSF, a process on a Unix machine, the Google Genomics Pipeline API, etc. In cases where a command line would not be an appropriate abstraction a *TaskInstance* would be the most atomic unit of work for that platform.

##Task
A *task* is an abstraction of a *TaskInstance*. A *task* provides metadata to describe provenance and execution but does not need to be directly runnable. Examples of metadata would be a docker image and descriptions of inputs and outputs.

##Workflow
A *workflow* is a composition of one or more *tasks* with information on dependencies and order. A *workflow* will have rich enough metadata to be directly runnable.

##Tool
A *tool* describes a single useful unit of work, ranging from a workflow with a single *task* to one with any arbitrary number of *tasks*. The term *tool* is less of a technical term than a concept for clients and end users and is more generic than other terms in this document. It is recommended that *tool* is not used in any official GA4GH documentation, however the term is so ubiquitious that we recognize others will be using it. Therefore we want to call special attention to how variable its meaning can be.

As an example, imagine a multi-*task* *workflow* for processing sequencing data through variant calling. The author of this *workflow* might view each individual *task* as a *tool* while the end user of that variant calling *workflow* would likely view the entire thing as the *tool*.

70 changes: 48 additions & 22 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,31 +1,29 @@
![ga4gh logo](http://genomicsandhealth.org/files/logo_ga.png)

Schemas for the Workflow Execution API
======================================
Schemas for the Workflow Execution Service (WES) API
====================================================

This is used by the Data Working Group - Containers and Workflows Task Team

Test [swagger editor](http://editor.swagger.io/#/?import=https://raw.githubusercontent.com/ga4gh/workflow-execution-schemas/develop/src/main/resources/swagger/ga4gh-tool-discovery.yaml)
<img src="swagger_editor.png" width="48">[View in Swagger](http://editor.swagger.io/#/?import=https://raw.githubusercontent.com/ga4gh/workflow-execution-schemas/develop/swagger/proto/workflow_execution.swagger.json)

The [Global Alliance for Genomics and Health](http://genomicsandhealth.org/) is an international
coalition, formed to enable the sharing of genomic and clinical data.

Containers and Workflows Task Team
----------------------------------

The [Data Working Group](http://ga4gh.org/#/) concentrates on data representation, storage,
and analysis, including working with platform development partners and
industry leaders to develop standards that will facilitate
interoperability.

Containers and Workflows Task Team
----------------------------------
interoperability. The Containers & Workflows working group is an informal, multi-vendor working group focused on standards for exchanging Docker-based tools and CWL/WDL workflows, execution of Docker-based tools and workflows on clouds, and abstract access to cloud object stores.

What is WES?
============
The Containers & Workflows working group is an informal, multi-vendor working group born out of the BOSC 2014 codefest, consisting of various organizations and individuals that have an interest in portability of data analysis workflows. Our goal is to create specifications that enable data scientists to describe analysis tools and workflows that are powerful, easy to use, portable, and support reproducibility for a variety of problem areas including data-intensive science like bioinformatics, physics, and astronomy; and business analytics such as log analysis, data mining, and ETL.

From within this group, two approaches have emerged, resulting in the production of two distinct but complementary specifications: the Common Workflow Language, or CWL, and the Workflow Description Language, or WDL. The CWL approach emphasizes execution features and machine-readability, and serves a core target audience of software and platform developers. The WDL approach, on the other hand, emphasizes scripting and human-readability, and serves a core target audience of research scientists.

Together, these two specifications cover a very wide spectrum of analysis use cases. Work is underway to ensure interoperability through conversion and related utilities.

What is this?
------------
-------------

Currently, this is the home of the Workflow Execution API proposal. The Workflow Execution API is a minimal common API describing how a user can submit workflow requests to workflow execution systems in a standardized ways.
Workflow execution engines (SevenBridges, FireCloud, etc) can support this API so users can make workflow requests
Expand All @@ -43,36 +41,64 @@ providers, just an example of how that would work if they did.
Key features of the current API proposal:

* ability to request a workflow run using CWL or WDL (and maybe future formats)
* ability to parameterize that workflow using a JSON schema that's simple and used in common between CWL and WDL
* ability to parameterize that workflow using a JSON schema (ideally a future version would be in common between CWL and WDL)
* ability to get information about running workflows, status, errors, output file locations etc
* to search for workflows by arbitrary key/values

Outstanding questions:

* JSON parameterization format, see work by Peter, is that checked in?
* a common JSON parameterization format, see work by Peter, is that checked in?
* standardizing terms, job, workflow, steps, tools, etc
* reference implementation, Peter pointed out https://github.com/common-workflow-language/cwltool-service/tree/ga4gh-wes
* validation service for testing WES implementations' conformance to the spec

How to view
------------

See the swagger editor to view our [schema in progress](http://editor.swagger.io/#/?import=https://raw.githubusercontent.com/ga4gh/tool-registry-schemas/develop/src/main/resources/swagger/ga4gh-tool-discovery.yaml).
See the swagger editor to view our [schema in progress](http://editor.swagger.io/#/?import=https://raw.githubusercontent.com/ga4gh/workflow-execution-schemas/develop/src/main/resources/swagger/ga4gh-tool-discovery.yaml).

If the current schema fails to validate, visit [debugging](http://online.swagger.io/validator/debug?url=https://raw.githubusercontent.com/ga4gh/workflow-execution-schemas/develop/src/main/resources/swagger/ga4gh-tool-discovery.yaml)

If the current schema fails to validate, visit [debugging](http://online.swagger.io/validator/debug?url=https://raw.githubusercontent.com/ga4gh/tool-registry-schemas/develop/src/main/resources/swagger/ga4gh-tool-discovery.yaml)
Building Documents
------------------

Make sure you have Docker installed for your platform and the `cwltool`.

virtualenv env
source env/bin/activate
pip install setuptools==28.8.0
pip install cwl-runner cwltool==1.0.20161114152756 schema-salad==1.18.20161005190847 avro==1.8.1

Make sure you have the [submodule](http://stackoverflow.com/questions/3939055/submodules-files-are-not-checked-out) checked out:

git submodule update --init --recursive

You can generate the [Swagger](http://swagger.io/) YAML from the Protocol Buffers:

cwltool CWLFile

Find the output in `workflow_execution.swagger.json` and this can be loaded in the [Swagger editor](http://swagger.io/swagger-editor/). Use the GitHub raw feature to generate a URL you can load.

When you're happy with the changes, checkin this file:

mv workflow_execution.swagger.json swagger/proto/

And commit your changes.

How to contribute changes
-------------------------

Take cues for now from the [ga4gh/schemas](https://github.com/ga4gh/schemas/blob/master/CONTRIBUTING.rst) document.

We like [HubFlow](https://datasift.github.io/gitflow/) and using pull requests to suggest changes.

License
-------

See the [LICENSE]

[]: http://genomicsandhealth.org/files/logo_ga.png
[Global Alliance for Genomics and Health]: http://genomicsandhealth.org/
[INSTALL.md]: INSTALL.md
[CONTRIBUTING.md]: CONTRIBUTING.md
[LICENSE]: LICENSE
[Google Forum]: https://groups.google.com/forum/#!forum/ga4gh-dwg-containers-workflows
More Information
----------------

* [Global Alliance for Genomics and Health](http://genomicsandhealth.org)
* [Google Forum](https://groups.google.com/forum/#!forum/ga4gh-dwg-containers-workflows)
214 changes: 214 additions & 0 deletions proto/workflow_execution.proto
Original file line number Diff line number Diff line change
@@ -0,0 +1,214 @@

syntax = "proto3";

package ga4gh_wes_;

// Import HTTP RESTful annotations
import "google/api/annotations.proto";
import "google/protobuf/struct.proto";
import "task_execution.proto";

//Enum for states
enum state {
Unknown = 0;
Queued = 1;
Running = 2;
Paused = 3;
Complete = 4;
Error = 5;
SystemError = 6;
Canceled = 7;
Initializing = 8;
}

//Enum for parameter types
enum parameter_types {
Directory = 0;
File = 1;
Parameter = 2;
}

//Log and other info
message log {
//The task or workflow name
string name = 1;
//The command line that was run
repeated string cmd = 2;
//When the command was executed
string startTime = 3;
//When the command completed
string endTime = 4;
//Sample of stdout (not guaranteed to be entire log)
string stdout = 5;
//Sample of stderr (not guaranteed to be entire log)
string stderr = 6;
//Exit code of the program
int32 exitCode = 7;
}

//Parameters for workflows or tasks, these are either output parameters, files, or directories, the latter two are stagged to an object store or something similar for hand back to the caller.
message parameter {
//REQUIRED
//name of the parameter
string name = 1;
//OPTIONAL
//Value
string value = 2;
//REQUIRED
//location in long term storage, is a url specific to the implementing
//system. For example s3://my-object-store/file1 or gs://my-bucket/file2 or
//file:///path/to/my/file
string location = 3;
//REQUIRED
//Type of data, "Parameter", "File" or "Directory"
//if used for an output all the files in the directory
//will be copied to the storage location
parameter_types type = 4;
}

//Return envelope for workflow listing
message workflow_list_response {
repeated workflow_desc workflows = 1;
string next_page_token = 2;
}

//Request listing of jobs tracked by server
message workflow_list_request {
//OPTIONAL
//Number of workflows to return at once. Defaults to 256, and max is 2048.
uint32 page_size = 1;
//OPTIONAL
//Token to use to indicate where to start getting results. If unspecified, returns the first page of results.
string page_token = 2;
//OPTIONAL
//For each key, if the key's value is empty string then match workflows that are tagged with this key regardless of value
string key_value_search = 3;
}

//Small description of workflows, returned by server during listing
message workflow_desc {
//REQUIRED
string workflow_id = 1;
//REQUIRED
state state = 2;
}

//workflow request object
message workflow_request {
//REQUIRED
//the workflow CWL or WDL document
string workflow_descriptor = 1;
//REQUIRED
//the workflow parameterization document (typically a JSON file), includes all parameterizations for the workflow including input and output file locations
string workflow_params = 2;
//REQUIRED
//the workflow descriptor type, must be "CWL" or "WDL" currently (or another alternative supported by this WES instance)
string workflow_type = 3;
//REQUIRED
//the workflow descriptor type version, must be one supported by this WES instance
string workflow_type_version = 4;
//OPTIONAL
//a key-value map of arbitrary metadata outside the scope of the workflow_params but useful to track with this workflow request
map<string,string> key_values = 5;
}

//Blank request message for service request
message service_info_request {}

//available workflow types supported by this WES
message workflow_type_version {
//an array of one or more version strings
repeated string workflow_type_version = 1;
}


//.
message service_info {
//A map with keys as the workflow format type name (currently only CWL and WDL are used although a service may support others) and value is a workflow_type_version object which simply contains an array of one or more version strings
map<string, workflow_type_version> workflow_type_versions = 1;
//The version(s) of the WES schema supported by this service
repeated string supported_wes_versions = 2;
//The filesystem protocols supported by this service, currently these may include common protocols such as 'http', 'https', 'sftp', 's3', 'gs', 'file', 'synapse', or others as supported by this service.
repeated string supported_filesystem_protocols = 3;
//The engine(s) used by this WES service, key is engine name e.g. Cromwell and value is version
map<string,string> engine_versions = 4;
//The system statistics, key is the statistic, value is the count of workflows in that state. See the State enum for the possible keys.
map<string,uint32> system_state_counts = 5;
//a key-value map of arbitrary, extended metadata outside the scope of the above but useful to report back
map<string,string> key_values = 6;
}

message workflow_run_id {
//workflow ID
string workflow_id = 1;
}

message workflow_status {
//workflow ID
string workflow_id = 1;
//state
state state = 2;
}

message workflow_log {
//workflow ID
string workflow_id = 1;
//the original request object
workflow_request request = 2;
//state
state state = 3;
//the logs, and other key info like timing and exit code, for the overall run of this workflow
log workflow_log = 4;
//the logs, and other key info like timing and exit code, for each step in the workflow
repeated log task_logs = 5;
//the outputs
repeated parameter outputs = 6;
}

//Web service to get, create, list and delete Workflows
service WorkflowExecutionService {

//Get information about Workflow Execution Service. May include information related (but not limited to) the workflow descriptor formats, versions supported, the WES API versions supported, and information about general the service availability.
rpc GetServiceInfo(service_info_request) returns (service_info) {
option (google.api.http) = {
get: "/ga4gh/wes/v1/service-info"
};
}

//Run a workflow, this endpoint will allow you to create a new workflow request and retrieve its tracking ID to monitor its progress. An important assumption in this endpoint is that the workflow_params JSON will include parameterizations along with input and output files. The latter two may be on S3, Google object storage, local filesystems, etc. This specification makes no distinction. However, it is assumed that the submitter is using URLs that this system both understands and can access. For Amazon S3, this could be accomplished by given the credentials associated with a WES service access to a particular bucket. The details are important for a production system and user on-boarding but outside the scope of this spec.
rpc RunWorkflow(workflow_request) returns (workflow_run_id) {
option (google.api.http) = {
post: "/ga4gh/wes/v1/workflows"
body: "*"
};
}

//Get quick status info about a running workflow
rpc GetWorkflowStatus(workflow_run_id) returns (workflow_status) {
option (google.api.http) = {
get: "/ga4gh/wes/v1/workflows/{workflow_id}/status"
};
}

//Get detailed info about a running workflow
rpc GetWorkflowLog(workflow_run_id) returns (workflow_log) {
option (google.api.http) = {
get: "/ga4gh/wes/v1/workflows/{workflow_id}"
};
}

//Cancel a running workflow
rpc CancelJob(workflow_run_id) returns (workflow_run_id) {
option (google.api.http) = {
delete: "/ga4gh/wes/v1/workflows/{workflow_id}"
};
}

//List the workflows, this endpoint will list the workflows in order of oldest to newest. There is no guarantee of live updates as the user traverses the pages, the behavior should be decided (and documented) by each implementation.
rpc ListWorkflows(workflow_list_request) returns (workflow_list_response) {
option (google.api.http) = {
get: "/ga4gh/wes/v1/workflows"
};
}

}
31 changes: 31 additions & 0 deletions protoc.Dockerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
FROM debian:8
ENV DEBIAN_FRONTEND noninteractive

RUN apt-get clean && \
apt-get update && \
apt-get -yq --no-install-recommends -o Acquire::Retries=6 install \
curl unzip ca-certificates git && \
apt-get clean

ENV PROTOC_VER=3.2.0rc2

RUN cd /usr/local && \
curl -OL https://github.com/google/protobuf/releases/download/v${PROTOC_VER}/protoc-${PROTOC_VER}-linux-x86_64.zip && \
unzip protoc-${PROTOC_VER}-linux-x86_64.zip && \
chmod o+r -R include/google && \
chmod o+x $(find include/google -type d) && \
rm protoc-${PROTOC_VER}-linux-x86_64.zip

ENV GOVERSION=1.7.5

# Install golang binary
RUN cd /usr/local && \
curl -O http://storage.googleapis.com/golang/go${GOVERSION}.linux-amd64.tar.gz && \
tar -xzf go${GOVERSION}.linux-amd64.tar.gz && \
rm go${GOVERSION}.linux-amd64.tar.gz

ENV PATH ${PATH}:/usr/local/go/bin
ENV GOPATH=/usr/local

RUN go get -u github.com/grpc-ecosystem/grpc-gateway/protoc-gen-swagger
RUN go get -u github.com/golang/protobuf/protoc-gen-go
Loading

0 comments on commit 0ef2085

Please sign in to comment.