Patches jra001k 20181227 #95

ghost · 2018-12-27T09:11:03Z

We don't track workflow step inputs in any formal way in our model currently. This has resulted in some current hacks and prevents future enhancements. This commit splits WorkflowStepConnection into two models WorkflowStepInput and WorkflowStepConnection - normalizing the previous table workflow_step_connection on input step and input name. In terms of current hacks forced on it by restricting all of tool state to be confined to a big JSON blob in the database - we have problems distinguishing keys and values when walking tool state. As we store more and more JSON blobs inside of the giant tool state blob - the worse this problem gets. Take for instance checking for runtime parameters or the rules parameter values - these both use JSON blobs that aren't simple values, so it is hard to tell looking at the tool state blob in the database or the workflow export to tell what is a key or what is a value. Tracking state as normalized inputs with default values and explicit attributes runtime values should allow much more percise state definition and construction. This variant of the models would also potentially allow defining runtime values with non-tool default values (so default values defined for the workflow but still explicitly settable at runtime). The combinations of overriding defaults and defining runtime values were not representable before. In terms of future enhancements, there is a lot we cannot track with the current models - such as map/reduce options for collection operations (galaxyproject#4623 (comment)). This should enable a lot of that. Obviously there are a lot of attributes defined here that are not yet utilized, but I'm using most (all?) of them downstream in the CWL branch. I'd rather populate this table fully realized and fill in the implementation around it as work continues to stream in from the CWL branch - to keep things simple and avoid extra database migrations. But I understand if this feels like speculative complexity we want to avoid despite the implementation being readily available for inspection downstream.

Co-Authored-By: jmchilton <jmchilton@gmail.com>

- Allow workflows to be uploaded via path using the GUI and and API for admins. - Track the path and resync workflows on save (both .ga and format 2 workflows) - this will allow Planemo to leverage the Galaxy workflow editor to interactively save workflows.

Used for dynamic tools downstream.

A lot more could be done here - support for multiple="true" parameters, collections, testing with identifiers, better error reporting if DatasetWrapper methods are used. I'll follow up with those things if people find the current approach acceptable. Implements galaxyproject#1682.

Based on work by Peter Amstutz in cwltool (https://github.com/common-workflow-language/cwltool).

The user is prompted for a JavaScript expression, which is in turn ran once per dataset in a list and used as filter. If the JavaScript evaluates to a Python truthy value, the HDA is copied into the output dataset (without duplicating the data on disk). The JavaScript expression is supplied various HDA attributes in the environment (currently all metadata values, file_size, file_ext, and dbkey). The supplied test case filters out datasets that do not contain an even number of lines. Testing: ``` ./run_tests.sh -api test/api/test_tools.py:ToolsTestCase.test_filter_0 ```

Takes in a list dataset collection and produces a list of lists keying the outer list on a user supplied function. This reuses the JavaScript expression code used by the filter model tool. Testing: ``` ./run_tests.sh -framework -id __GROUP__ ```

- Introduce models and a API for creating tools dynamically. - Use Galaxy's testing-only YAML based representation of tools to prototype this. - Extend Format 2 workflow definitions to allow embedding tools directly into workflows, either directly or using a CWL-style @import syntax. Testing: Test cases demonstrating tools can be imported (only by admins) and are runnable are included with this commit. More test cases regarding workflow use of dynamic tools and Format 2 workflow definition extensions are also included. These tests can be run with the following commands: ``` ./run_tests.sh -api test/api/test_tools.py:ToolsTestCase.test_nonadmin_users_cannot_create_tools ./run_tests.sh -api test/api/test_tools.py:ToolsTestCase.test_dynamic_tool_1 ./run_tests.sh -api test/api/test_workflows.py:WorkflowsApiTestCase.test_import_export_dynamic ./run_tests.sh -api test/api/test_workflows_from_yaml.py:WorkflowsFromYamlApiTestCase.test_workflow_embed_tool ./run_tests.sh -api test/api/test_workflows_from_yaml.py:WorkflowsFromYamlApiTestCase.test_workflow_import_tool ```

- Tool definition languge and plumbing and datatype for expressing expressions as jobs. - Allow connecting expression tools to parameters in workflows, will delay evaluation of workflow so calculated value - Example test expression tools for testing and demonstration. - [WIP] Workflow expression module to allow users to specify arbitrary expressions.

Existing dataset colleciton types are meant to be homogenous - all datasets of the same time. This introduces CWL-style record dataset collections.

…rmats. This should support a subset of [draft-3](http://www.commonwl.org/draft-3/) and [v1.0](http://www.commonwl.org/v1.0/) tools. CWL Support (Tools): -------------------- - Implemented integer, long, float, double, boolean, string, File, Directory, "null", Any, as well as records and arrays thereof. There are two approaches to handling more complex parameters discussed here (common-workflow-lab#59). - ``secondaryFiles`` that are actual Files are implemented, secondaryFiles containing directories are not yet implemented. - ``InlineJavascriptRequirement`` are support to define output files (see ``test_cat3`` test case). - ``EnvVarRequirement``s are supported (see the ``test_env_tool1`` and ``test_env_tool2`` test cases). - Expression tools are supported (see ``parseInt-tool`` test case). - Shell tools are also support (see record output test case). - Default File values are very un-Galaxy and have been hacked into work with Tools - they still don't work with workflows. - Partial Docker support - this supports the most simple and common pullFrom semantics but not additional ways to fetch containers or additional options such as output directory configuration (https://github.com/common-workflow-language/galaxy/issues?utf8=%E2%9C%93&q=is%3Aissue%20is%3Aopen%20Docker). Additionally, Galaxy mounts the inputs and outputs where it wants instead of CWL required mount points - this needs to be fixed for the conformance tests but may not matter much in practice (I'm not sure). CWL Support (Workflows): ------------------------ - Simple connections and tool execution. - Overriding tool input defaults via literal values and simple expressions. - MultipleInputFeatureRequirements to glue together multiple file inputs into a File[] or multiple File[] into a single flat File[]. (nested merge is still a TODO). - Simple scatter semantics for Files and non-Files (e.g. count-lines3). - Simple subworkflows (e.g. count-lines10). - Simple valueFrom expressions (e.g. ``step-valueFrom`` and ``step-valueFrom2``). This work doesn't yet model non-tool parameters to steps - for complex ``valueFrom`` expressions like in ``step-valueFrom3`` do not work yet. Remaining Work --------------------------------- The work remaining is vast and will be tracked at https://github.com/common-workflow-language/galaxy/issues for the time being. Implementation Notes: ---------------------- Tools: - Non-File CWL outputs are represented as ``expression.json`` files. Traditionally Galaxy hasn't supported non-File outputs from tools but CWL Galaxy has work in progress on bringing native Galaxy support for such outputs common-workflow-lab#27. - CWL secondary files are just normal datasets with extra files stored in ``__secondary_files__`` directory in the dataset's extra_files_path directory and indexed in a file called __secondary_files_index.json in extra_files_path. The upload tools has been augmented to allow attaching arbitrary extra files as a tar file to support getting data into this format initially. CWL requires staging files to include their parent File's ``basename`` - but tools describe inputs as just the extension. I'm not sure which way Galaxy should store __secondary_files__ in its objectstore - just with the extension or with the basename and extension - both options are implemented and can be swapped by setting the boolean STORE_SECONDARY_FILES_WITH_BASENAME in galaxy.tools.cwl.util. - CWL Directory types are datasets of a new type "directory" implemented earlier in this branch. - The tool execution API has been extended to add a ``inputs_representation`` parameter that can be set to "cwl" now. The ``cwl`` representation for running tools corresonding to the CWL job json format with {class: "File: path: "/path/to/file"} inputs replaced with {"src": "hda", "id": "<dataset_id>"}. Code for building these requests for CWL job json is available in the test class. - Since the CWL <-> Galaxy parameter translation may change over time, for instance if Galaxy develops or refines parameter classes - CWL state and CWL state version is tracked in the database and hopefully for reruns, etc... we could update the Galaxy state from an older version to a new one. - CWL allows output parameters to be either ``File`` or non-``File`` and determined at runtime, so ``galaxy.json`` is used to dynamically adjust output extension as needed for non-``File`` parameters. Workflows: - This work serializes embedded and referenced tools into the database - this will allow reuse and tracing without require the path to exist forever on the filesystem - this will have problems with default file references in workflows. - Implements re-mapping CWL workflow connections to Galaxy input connections. - Fix tool serialization for jobs for path-less tools (such as embedded tools). - Hack tool state during workflow import for CWL. - The sort of dynamic shaping of inputs CWL allows has required enhancing Galaxy's map/reduce stuff to allow mapping over dynamic collections that don't yet exist at the time of tool execution and need to be created on the fly. This commit creates them as HDCAs - but likely they should be something else that doesn't appear in the history panel. - Multi-input scattering but only scatterMethod == "dotproduct" is currently support. Other scatter methods (nested_crossproduct and flatcross_product) are not used by workflows in GA4GH challenge. Implementation Description: ----------------------------- The reference implementation Python library (mainly developed by Peter Amstutz - https://github.com/common-workflow-language/common-workflow-language/tree/master/reference) is used to load tool files ending with ``.json`` or ``.cwl`` and proxy objects are created to adapt these tools to Galaxy representations. In particular input and output descriptions are loaded from the tool. When the tool is submitted, a special specialized tool class is used to build a cwltool compatible job description from the supplied Galaxy inputs and the CWL reference implementation is used to generate a CWL reference implementation Job object. A command-line is generated from this Job object. As a result of this - Galaxy largely does not need to worry about the details of command-line adapters, expressions, etc.... Galaxy writes a description of the CWL job that it can reload to the job working directory. After the process is complete (on the Galaxy compute server, but outside the Docker container) this representation is reloaded and the dynamic outputs are discovered and moved to fixed locations as expected by Galaxy. CWL allows for much more expressive output locations than Galaxy, for better or worse, and this step uses cwltool to adapt CWL to Galaxy outputs. Currently all ``File`` outputs are sniffed to determined a Galaxy datatype, CWL allows refinement on this and this remains work to be done. 1) CWL should support EDAM declaration of types and Galaxy should provide a mapping to core datasets to skip sniffing is types are found. 2) For finer grain control within Galaxy, extensions to CWL should allow setting actual Galaxy output types on outputs. (Distinction between fastq and fastqsanger in Galaxy is very important for instance.) Implementation Links: ---------------------- Hundreds of commits have been rebased into this one and so the details of individual parts of the implementation and how they built on each other are not enitrely clear. To see the original ideas behind individual features - here are some relevant links: - Implement merge_nested link semantics for workflow steps (common-workflow-lab@a903abd). - Implement subworkflows in CWL (common-workflow-lab@9933c3c) - MultipleInputFeatureRequirements: - Second attempt: common-workflow-lab@ed8307f - First attempt: common-workflow-lab@ae11f56 - Basic, implicit dotproduct scattering of workflows - common-workflow-lab@d1ad64e. - Simple input StepInputExpressionRequirements - common-workflow-lab@819a27b - StepInputExpressionRequirements for multiple inputs - common-workflow-lab@5e7f622 - Record Types in CWL - common-workflow-lab@e6be28a - Rework original approach at mapping CWL state to tool state - common-workflow-lab@669ea55 - Rework approach at mapping CWL state to tool state again to use "FieldTypeToolParameter"s - implements default values, optional parameters, and union types for workflow inputs. common-workflow-lab@d1ca22f - Initial tracking of "cwl_filename" for CWL jobs (common-workflow-lab@67ffc55). - Reworked secondary file staging, implement testing and indexing of secondary files - common-workflow-lab@03d1636. Testing: --------------------- % git clone https://github.com/common-workflow-language/galaxy.git % cd galaxy % git checkout cwl-1.0 Start Galaxy. % GALAXY_RUN_WITH_TEST_TOOLS=1 sh run.sh Open http://localhost:8080/ and see CWL test tools (along with all Galaxy test tools) in left hand tool panel. To go a step further and actually run CWL jobs within their designated Docker containers, copy the following minimal Galaxy job configuration file to ``config/job_conf.xml``. (Adjust the ``docker_sudo`` parameter based on how you execute Docker). https://gist.github.com/jmchilton/3997fa471d1b4c556966 Run API tests demonstrating the various CWL demo tools with the following command. ``` ./run_tests.sh -api test/api/test_tools_cwl.py ./run_tests.sh -api test/api/test_workflows_cwl.py ./run_tests.sh -api test/api/test_cwl_conformance_v1_0.py ``` An individual conformance test can be ran using this pattern: ``` ./run_tests.sh -api test/api/test_cwl_conformance_v1_0.py:CwlConformanceTestCase.test_conformance_v1_0_6 ``` The first two execute various tool and workflow test cases manually crafted during implementation of this work. The third is an auto-generate test case class that contains Python tests for every CWL conformance test found with the reference specification. Issues and Contact --------------------------------- Report issues at https://github.com/common-workflow-language/galaxy/issues and feel free ping jmchilton on the CWL [Gitter channel](https://gitter.im/common-workflow-language/common-workflow-language).

We rely too much on history.state which is quite broken with empty collections, etc...

gx:Interface.

- display output json like cwltool, as a single indented json object (to allow running conformance tests) - download data (still needs a lot of work)

simple_value() method duplicated for diff clarity (both are identical). Both methods should be merged into one after this commit.

More info: hmenager/workflow-is-cwl#1

- New scripts for local fast CI, using docker postgres.

When importing a packed workflow in Galaxy, exception below occurs: Traceback (most recent call last): File "lib/galaxy/web/framework/decorators.py", line 283, in decorator rval = func(self, trans, *args, **kwargs) File "lib/galaxy/webapps/galaxy/api/workflows.py", line 337, in create return self.__api_import_from_archive(trans, archive_data, source="uploaded file", from_path=os.path.abspath(uploaded_file_name)) File "lib/galaxy/webapps/galaxy/api/workflows.py", line 596, in __api_import_from_archive raw_workflow_description = self.__normalize_workflow(trans, data) File "lib/galaxy/webapps/galaxy/api/workflows.py", line 679, in __normalize_workflow return self.workflow_contents_manager.normalize_workflow_format(trans, as_dict) File "lib/galaxy/managers/workflows.py", line 315, in normalize_workflow_format workflow_path += "#" + object_id TypeError: unsupported operand type(s) for +=: 'NoneType' and 'str' This commit adds 'src' and 'path' attributes in 'data' dict to prevent the exception.

Provides the same functionality as 6e675f3 but maps only tar files marked as 'directory' (with the type combobox in upload tool).

This error was due to directory being extracted in /tmp, which is not visible by docker.

jmchilton · 2019-03-04T14:29:40Z

Merged changes in manually and rebased the whole branch against the latest Galaxy development release. Thanks!

jmchilton and others added 30 commits November 15, 2018 17:50

Update lib/galaxy/dependencies/pipfiles/default/pinned-requirements.txt

759c8df

Co-Authored-By: jmchilton <jmchilton@gmail.com>

Update lib/galaxy/model/mapping.py

4d5191f

Co-Authored-By: jmchilton <jmchilton@gmail.com>

Update lib/galaxy/model/mapping.py

8da9593

Co-Authored-By: jmchilton <jmchilton@gmail.com>

Update lib/galaxy/workflow/modules.py

f867fe8

Co-Authored-By: jmchilton <jmchilton@gmail.com>

Cleanup recommmended by eagle eyes.

7fe4a46

Workflow test case, inputs as outputs.

b6c87cd

Path-aware workflows.

348413a

- Allow workflows to be uploaded via path using the GUI and and API for admins. - Track the path and resync workflows on save (both .ga and format 2 workflows) - this will allow Planemo to leverage the Galaxy workflow editor to interactively save workflows.

Allow tools to not define a tool_directory (e.g. in-memory tools).

72f4da4

Used for dynamic tools downstream.

Allow unit testing tool init to specify paths.

e712642

Library methods for dealing with sandboxed JavaScript expressions.

7510968

Based on work by Peter Amstutz in cwltool (https://github.com/common-workflow-language/cwltool).

[WIP] Implement records - heterogenous dataset collections.

9850b10

Existing dataset colleciton types are meant to be homogenous - all datasets of the same time. This introduces CWL-style record dataset collections.

Add CWL conformance test data to Galaxy for testing.

d470522

Support newer cwltool signature for select resources.

cf4b395

Better error handling for workflow dict handling.

cec8565

Implement workflow defaults to handle wf_simple.

3a73618

Fix wf_scatter_dotproduct_twoempty by hacking testing framework.

d6a6de2

We rely too much on history.state which is quite broken with empty collections, etc...

More logging messages... REMOVE LATER

d42f276

WIP: Work toward Galaxy-flavored CWL tools.

3dcd160

gx:Interface.

rough cut at implementing should_fail in cwltest framework.

46f7a0c

[WIP] modify run_cwl script to make it behave like cwltool

c9b387e

- display output json like cwltool, as a single indented json object (to allow running conformance tests) - download data (still needs a lot of work)

Fix path issue.

5fb5754

reverse run_cwl modification to be able to use paths for tools and jobs

9bda3f3

Add run_cwl_conformance.py.

9138309

jra001k and others added 27 commits November 15, 2018 18:47

Set beta_relaxed_fmt_check to true. Add workflow-is-cwl conf entries.

cac56cd

Remove duplicate.

287f054

Cosmetic.

407c9fa

Fix 'ValidationException' which occurs when optional file is unset.

89b7ec9

Add missing mapping between Galaxy type and CWL type.

6b6168f

simple_value() method duplicated for diff clarity (both are identical). Both methods should be merged into one after this commit.

Map tar file to 'Directory' type.

6e675f3

Prevent flooding Galaxy left panel with tools description and label.

ea0ff28

More info: hmenager/workflow-is-cwl#1

Allow JSON values here again.

30697c8

Revert directory mapping, breaks Directory output test.

22231fe

Fix value things...

288e457

Rev cwltool version.

ed483d8

Update CWL red/green list.

655dcbd

conformance improvements...

6b11bba

conformace tests...

c81661a

required conformance tests...

c2bec13

Gen required tests...

2916bac

Fixes for CWL.

6169a21

- New scripts for local fast CI, using docker postgres.

Better tear down?

1173579

Missing thing...

f149537

Another run_cwl.py fix.

129e550

cwl_run: implement --version

f01f059

rev cwltool to july 2018

c8a3bd7

remove input parameter formats from CWL files

9181949

WIP: implement field parameter type for CWL.

9899eb1

Map tar file to 'Directory' type.

a59309f

Provides the same functionality as 6e675f3 but maps only tar files marked as 'directory' (with the type combobox in upload tool).

Fix "File is not accessible error".

ceb3a48

This error was due to directory being extracted in /tmp, which is not visible by docker.

jmchilton force-pushed the cwl-1.0 branch from 9899eb1 to b1fe4bf Compare March 4, 2019 14:29

jmchilton closed this Mar 4, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Patches jra001k 20181227 #95

Patches jra001k 20181227 #95

ghost commented Dec 27, 2018

jmchilton commented Mar 4, 2019

Patches jra001k 20181227 #95

Patches jra001k 20181227 #95

Conversation

ghost commented Dec 27, 2018

jmchilton commented Mar 4, 2019