-
Notifications
You must be signed in to change notification settings - Fork 37
Data Access Credentials #227
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: develop
Are you sure you want to change the base?
Conversation
1. Modified workflow_params description to indicate it's required unless workflow_unified_params is provided 2. Added workflow_unified_params field with the hybrid structure we discussed 3. Enhanced file metadata support with optional fields for size, checksum, secondary files, format, and modification time 4. Added comprehensive validation constraints for different parameter types 5. Validated the schema - no OpenAPI validation issues detected Key Features of the Implementation: - Version field for format evolution (default: 1.0) - Rich file metadata (size, checksum, secondary_files, format, last_modified) - Comprehensive validation constraints (min/max, length, pattern, enum, array limits) - Type-safe parameter definitions with clear enums - Backward compatibility - existing workflow_params still works - Precedence handling - workflow_unified_params takes precedence when provided
Key Improvements Made:
1. Dual Content Type Support
- application/json (Recommended): Uses the proper RunRequest model object
- multipart/form-data (Legacy): Maintains backward compatibility for file uploads
2. Proper Model Usage
- JSON requests now use $ref: '#/components/schemas/RunRequest'
- Leverages all the rich typing and validation from the RunRequest schema
- Supports both workflow_params and workflow_unified_params
3. Enhanced Documentation
- Clear guidance on when to use each content type
- Explains file handling differences between formats
- Documents the new unified parameter format
- Security considerations for file uploads
4. Better Developer Experience
- OpenAPI tooling can generate proper client code for JSON requests
- Type safety with structured objects instead of string parsing
- Validation happens automatically with the model schema
- Consistency across the API
Usage Examples:
Preferred JSON format:
POST /runs
Content-Type: application/json
{
"workflow_type": "CWL",
"workflow_type_version": "v1.0",
"workflow_url": "https://example.com/workflow.cwl",
"workflow_unified_params": {
"version": "1.0",
"parameters": {
"input_file": {
"type": "File",
"value": "gs://bucket/input.fastq",
"file_metadata": {
"size": 1073741824,
"checksum": "sha256:abc123..."
}
}
}
}
}
Legacy multipart format (when file uploads needed):
POST /runs
Content-Type: multipart/form-data
workflow_type: CWL
workflow_unified_params: {"version":"1.0","parameters":{...}}
workflow_attachment: [binary file data]
1. Updated RunLog schema - Added structured_outputs field alongside the existing outputs 2. Added WorkflowOutputs schema - Main container for structured outputs with version and metadata 3. Added OutputObject schema - Flexible output type supporting Files, Directories, Arrays, and primitives 4. Added documentation tags - Both schemas appear in the Models section of the API docs Key Features Implemented: WorkflowOutputs Schema: - Version field for format evolution - Named outputs with rich metadata - Workflow-level metadata (execution ID, timing, resource usage) - Provenance tracking (engine, version, status) OutputObject Schema: - Type system - File, Directory, Array, String, Integer, Float, Boolean - File metadata - location, size, checksum, format, basename - Provenance - source task, command, creation time - Secondary files - Associated files like indexes - Array support - Collections of outputs - Content embedding - Small file contents can be included Backward Compatibility: - Existing outputs field remains unchanged (marked as "legacy format") - structured_outputs is optional - implementations can provide either or both - No breaking changes to existing API consumers
…ondary workflow URLs beyond the primary workflow
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
…low-execution-service-schemas into feature/issue-176-wes-params
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
… in WES (except /service-info)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR enhances the Workflow Execution Service (WES) OpenAPI specification to version 1.2.0, adding GA4GH Passport authentication support and data access credential management capabilities. The changes maintain backward compatibility while enabling workflow engines to securely access input data and provide credentials for output access.
- Adds POST endpoints with GA4GH Passport authentication support alongside existing bearer token authentication
- Introduces structured credential management for input and output data access via DRS and other services
- Implements unified parameter format for workflow language-agnostic parameter passing
- Expands workflow engine support beyond CWL/WDL to include Nextflow and Snakemake
Comments suppressed due to low confidence (1)
openapi/workflow_execution_service.openapi.yaml:7
- The updated logo URL path may not exist. The original path used 'ga4gh-theme' while the new path uses 'ga4gh/dist/assets/svg/logos/'. Verify this URL is accessible and correct.
url: 'https://www.ga4gh.org/wp-content/themes/ga4gh/dist/assets/svg/logos/logo-full-color.svg'
| x-in: body | ||
| bearerFormat: JWT | ||
| description: | ||
| A valid GA4GH Passport must be passed in the body of an HTTP POST request as a passports[] array. |
Copilot
AI
Jul 31, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The 'x-in: body' extension in PassportAuth security scheme is non-standard OpenAPI. Standard security schemes don't support body-based authentication. This could cause issues with code generation tools and API documentation renderers.
| x-in: body | |
| bearerFormat: JWT | |
| description: | |
| A valid GA4GH Passport must be passed in the body of an HTTP POST request as a passports[] array. | |
| bearerFormat: JWT | |
| description: | |
| A valid GA4GH Passport must be included in the request body as a passports[] array for relevant endpoints. |
| OPTIONAL | ||
| The workflow run parameterizations (JSON encoded), including input and output file locations. | ||
| Either `workflow_params` or `workflow_unified_params` must be provided, `workflow_unified_params` takes precedence. | ||
| workflow_unified_params: | ||
| type: object | ||
| description: >- | ||
| OPTIONAL | ||
| The workflow run parameterizations (JSON encoded), including input and output file locations | ||
| Unified parameter format that can be converted to workflow-language-specific format. | ||
| If provided, takes precedence over workflow_params. WES implementations should | ||
| convert these to the appropriate native format for the specified workflow_type. |
Copilot
AI
Jul 31, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The workflow_params field description states it's 'OPTIONAL' but then says 'Either workflow_params or workflow_unified_params must be provided'. This creates confusion about whether the field is truly optional. Consider clarifying that one of the two parameter fields is required.
| type: array | ||
| items: | ||
| type: string | ||
| description: an array of one or more acceptable types for the `workflow_type`, must be "CWL", "WDL", "Nextflow", or "Snakemake" currently (or another alternative supported by this WES instance, see service-info) |
Copilot
AI
Jul 31, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's inconsistent capitalization for 'Snakemake'. In line 13, it's written as 'Snakemake' but in line 1380 and elsewhere it appears as 'Snakemake'. The correct spelling is 'Snakemake' (not 'SnakeMake').
| required: false | ||
| description: >- | ||
| Files to be staged for workflow execution. You set the filename/path using the Content-Disposition header in the multipart form submission. For example 'Content-Disposition: form-data; name="workflow_attachment"; filename="workflows/helper.wdl"'. The files are staged to a temporary directory and can be referenced by relative path in the workflow_url. | ||
| required: |
Copilot
AI
Jul 31, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The 'required' array for multipart/form-data schema only includes workflow_type, workflow_type_version, and workflow_url, but doesn't include the passport field when using passport authentication. This could allow requests without proper authentication to be considered valid.
| required: | |
| required: | |
| - passport |
|
Thanks a lot for this effort, @briandoconnor! But I need to be honest here: I always had a strong dislike for the Apart from the security concerns which you already mentioned (sooner or later, someone will end up logging the request - or, in this case, also the response - bodies), I think reserving I guess there are cases where breaking REST semantics (and implications following from that, e.g., with respect to caching and idempotency) is an acceptable tradeoff. But in the case of passing around Passports, it just seems like a bad design decision on top of a bad design decision to me: Ever since the April Connect in Switzerland, it was abundantly discussed and, IIRC, predominantly agreed with, that a generic mechanism by which assertions/visas could be provided back to a clearing house upon presentation of a small Passport would constitute a cleaner design for Passports themselves, as well as for any APIs consuming them. Have sentiments changed since? If so, do you know what's the basis for the change of heart? Besides, I think if credentials need to be passed around in the cloud, the state of the art seems to be the use of secret stores. In particular, HashiCorp Vault's API is open sourced and is a de factor industry standard. Wouldn't it be better to explore an approach like that, which could be reused for all such cases (data access, compute access, tool access credentials in DRS, TRS, WES, TES etc.), rather than complicating schemas and duplicating endpoints across a whole bunch of API products individually? |
|
Sorry for interfering, but I don't think it is a wise decision to adopt a "de facto industry standard" that is owned by a company for an open standard like the GA4GH standards are. There are numerous cases, where such companies arbitrarily changed their license (Akka, Anaconda, and you can probably find more) and suddenly users had to adapt. A standard should be grounded on a stronger foundation than the good will of a company. Sure, in both mentioned examples there were workarounds (Akka->Pekko, Anaconda->Mamba & Co.), but that does not mean they were trivial (the Anaconda licensing issue forced us to do lots of tests to ensure our workflows were backwards compatible). While such a change may be trivial for a standard -- just write "use the new fork of the old tool" -- the situation may be very different for the implementers and the operating institutions. The Hashicorp Vault license says
This sounds vague and arbitrary. If Hashicorp decides to make their money in a different way that overlaps with the use of their tool in GA4GH we have to change the standard. This is the current license. So, I interpret that as that even for the current vault version a normal fork, like with Akka->Pekko, would not even be a legally solid option. I'm not a lawyer, sure, so I could be completely wrong. |
You are not interfering, @vinjana - every voice is important! I am not generally disagreeing with your important point, but I would like to clarify two points:
I would also like to note that there were previous discussions within the GA4GH community around potentially recommending existing standards in favor of drafting new ones, in order to avoid this situation. Personally, I think that makes a lot of sense, especially in areas that are not intrinsically related to life sciences, genomics, healthcare etc. AFAIK, no official decision has been made or policy drafted to define rules around recommending external standards, e.g., with respect to the licensing terms, governance, revising recommendations etc. In this regard, the Vault API may be as good an example as we might get (at least in the Cloud WS) as a driver for such a discussion. FWIW, I wholeheartedly agree with you that we should probably not recommend an API spec or other standard that is not published under an OSI-approved license. |
patmagee
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can understand the challenge that this PR is trying to solve, however I fall in line with @uniqueg opinion here. I think breaking REST semantics is essentially a bad idea that will make the API harder to implement but also much less intuitive for the end user. IN particular I think making endpoints polymorphic makes writing clients in certain languages harder and also makes machine Interoperability harder IMO. If we really want to support POST then we may want to propose a totally separate RPC based api where these type of semantics are more accepted.
Additionally, If Passports are simply not usable within the context of a RESTful API service, it feels like we should be rethinking passports (similar to what @uniqueg mentioned as well), instead of propagating a bad design paradigm. Internally we use token-exchange extensively which dramatically reduces the size of tokens that we pass around. Additionally, I know clouds have essentially the same thing, they typically call them things like "Workload Identity Federation". These are not vender specific but largely are using token exchange under the hood
These are not vendor specific solutions and I would highly urge us to evaluate standards that are being developed by the broader software industry as they become available. Passport was initially created before things like token-exchange were adopted, but maybe we should revisit (not to use necessarily, just for inspiration.
Fundamentally, this PR is trying to achieve granting the engine access to data from a federated source using a passport or some other access credential. I think that is a hard problem but maybe for a different reason.
Unlike TES, which is low level and can actually define how a single task interacts with underlying data (Ie it can write the command to use the credential), WES is a high level abstraction that in almost all cases sits in front of an engine. Any functionality we specify WES needs to fall into two categories of support
- Something the engines are willing to adopt
- Something the WES Server can do itself before handing off to the WES engine.
Data access is a hard one. I have done it in a similar way that is proposed here in a WES service. Essentially you end up needing to copy the data to a bucket from within your server or have some other workflow / batch job run. While this is technically doable (i did it), it was slow, error prone and caused data duplication. I could not read the data in-place once in the workflow.
On the flip side, we could not actually do this at the engine level because none of the engines supported this type of resolution and we are not an engine author ourselves. Additionally, most engines (at least for WDL) do not have a way of handling secrets, which basically means passing in sensitive information to the workflow itself should be avoided.
I don't think this means we cannot do it. BUT I do think we often are overlooking the key element here... the engine. A lot of WES requires the feature set provided by the engines
| structured_outputs: | ||
| $ref: '#/components/schemas/WorkflowOutputs' | ||
| description: Structured workflow outputs with rich metadata and type safety | ||
| output_credentials: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do not think we should be embedding output credentials in the outputs, ideally we should NEVER have to return credentials but at most use something like a signed URL.
There are a few concerns here
- Much less secure and exposes access credentials. I think returning DRS ids and then having the caller call each individually is moderately better, but even that pattern is not ideal
- The engine may not be able to generate output credentials
- Output credentials can be difficult to produce or lock down scope. IE a passport, bearertoken, api key could allow the user to access much more then JUST the outputs of the current run. A lot of DRS apis returned signed URls but these would be expensive to produce for a large workflow
|
The more I think about the problem of Passport credentials, the more I realize we are actually on a very troubling path that is taking us away from established security best practices and current trends in the software industry. Putting aside the challenges of overloading endpoints and breaking restful semantics, using a POST body to send credentials is problematic in the general case
Additionally, the industry is largely going passwordless or putting a strong emphasis on platform provided credentials. There is a broad recognition that passing around ANY credential is typically a weak leek in the system so the best way to mitigate that is to simply not provide the raw credential to the user. Many clouds will now provide Credentials to their VM's without you having to make them yourselves, allowing the user to assign permissions to the VM's identity instead of passing in a particular credential. Additionally, most clouds have rolled out Workload Identity Federation allowing you to do OAuth2 token-exchange dramatically reducing the attack surface by only passing around a single credential that can be exchange by trusted client for the credentials needed to access other resources. Introducing a new approach where we are running the opposite direction to this feels like it has the potential for some serious mixups I also think there is a general problem in how we are representing passport in general. The credentials that are being passed in are meant for OTHER services and are not meant for the WES service. This introduces a variety of challenges:
I think we need a broader discussion including security experts about how we should implement passport. My Gut feeling as that the current trajectory is taking is farther and farther away from established best practices and I would strongly advocate for deviating too far from them. |
Overview
This pull request updates the Workflow Execution Service (WES) OpenAPI specification (starting with #226) to enhance functionality in several key ways without creating breaking changes (hopefully). Key changes include 1) adding POST endpoints to support authentication/authorizing with a GA4GH Passport rather than a bearer token, 2) providing credentials for a WES server to access DRS (or other) inputs, and 3) for providing back credentials for the output of the workflow run to be accessible. Claude Code was used to generate some of these changes.
Related issues
workflow_params_drs_headersfield toRunWorkflowendpoint #131Related Standards
This implementation aligns with:
data access patterns
for backward compatibility
Related PR
The feature branch behind this PR is based on
feature/issue-176-wes-paramsand notdevelopsince I was building off of the structured input/output. See #226eLwazi-hosted GA4GH Hackathon
The eLwazi hosted GA4GH hackathon 7/28-8/1 is working on this issue given the need by various groups attending the session. For more info, see the agenda.
Built Documentation
The human-readable documentation: https://ga4gh.github.io/workflow-execution-service-schemas/preview/feature/issue-18-data-access-creds/docs/index.html
Issues/questions for discussion
POST /runs/listendpoint instead of overloadingPOST /runsbased on what request body schema is used?not supported... useful for the POST endpoints on existing paths.
Authentication/Authorization for WES Requests
I followed the same pattern here as used in DRS, making POST methods for existing endpoints that now support including a passport array in the body. These mirror the existing GET endpoints that use a bearer token.
Making Requests
Here's an example showing a bearer token request vs. the Passport in body request:
POST and /runs endpoint
This approach makes the
/runsendpoint complicated. It feels overloaded since it's doing multiple things:GET /runswith a bearer token it's a list of runsPOST /runswith a bearer token (or Passport in the body) and eitherapplication/jsonbody schema of typePassportRunRequestormultipart/form-datayou get a new workflow runPOST /runswith a passport in the body usingapplication/jsonbody schema of typeRunListRequestthen you get a list of runsSo it might be more clear if we have a
POST /runs/listendpoint instead... so/runsis less overloaded.Sending Credentials
In addition to adding support for Passports in WES requests (as an alternative to bearer tokens), this PR includes syntax extensions to allow for providing credentials to access inputs and pass back credentials so callers can access outputs.
Data Inputs
Here's how a client would structure a workflow submission:
Data Outputs
Now GetRunLog responses can include credentials for accessing outputs:
Security Considerations
access the outputs. We need to really think about how this works for Passports since typically these are issued through an OIDC flow and not just handed around. API keys and bearer tokens are a little easier since it would be possible to issue temporary/limited scoped tokens.
Implementation Notes
passport credentials
failures.