Skip to content

Commit

Permalink
Merge branch 'dev2'
Browse files Browse the repository at this point in the history
  • Loading branch information
ThomasHickman committed Jan 11, 2018
2 parents a6f765a + e129596 commit 0abd915
Show file tree
Hide file tree
Showing 12 changed files with 225 additions and 171 deletions.
18 changes: 15 additions & 3 deletions CHANGELOG
Original file line number Diff line number Diff line change
@@ -1,8 +1,20 @@
v1.4
- Added no_docker command line option, changed docker_container_name to docker_image_name
v2.0
- Added no_docker command line option
- Fixed beheviour of array and tag types
- Rewrote gen_cwl_arg to be clearer
- Reduced the dependance on the command line for the tests
### Breaking changes:
- Removed the refIndex, refDict and analysis_type cwl parameters
- Changed docker_container_name to docker_image_name

v1.4.2
- Use full GATK versions in release paths (i.e. "3.8-0" instead of just "3-8") and refer to specific version of latest GATK 4 beta release ("4.beta.6" rather than "latest").

v1.4.1
- Add releases in .tgz and .tar.bz2 archive formats. Add gatk-cwl-generator version to release filenames.

v1.4
- Changes generated CWL to YAML format rather than JSON for improved readability.

v1.3
- Changed the docker container to be broad institute's official docker container, not wtsi-hgi own container
Expand All @@ -13,6 +25,6 @@ v1.2.2
v1.2.1
- Array types are type | type[]

v1.1:
v1.1:
- Made output arguments's ids be <NAME>Output instead of adding dashes
- Outputting to gatk_cmdline_tools
26 changes: 16 additions & 10 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -21,10 +21,10 @@ You may also want to install [cwltool](https://github.com/common-workflow-langua
## Usage

```
usage: gatk_cwl_generator [-h] [--version VERSION] [--out OUTPUT_DIR]
[--include INCLUDE] [--dev] [--no_docker]
[--docker_image_name DOCKER_IMAGE_NAME]
[--gatk_command GATK_COMMAND]
usage: gatkcwlgenerator [-h] [--version VERSION] [--verbose] [--out OUTPUT_DIR]
[--include INCLUDE] [--dev] [--use_cache [CACHE_LOCATION]]
[--no_docker] [--docker_image_name DOCKER_IMAGE_NAME]
[--gatk_command GATK_COMMAND]
Generates CWL files from the GATK documentation
Expand All @@ -33,14 +33,18 @@ optional arguments:
--version VERSION, -v VERSION
Sets the version of GATK to parse documentation for.
Default is 3.5-0
--verbose Set the logging to be verbose. Default is False.
--out OUTPUT_DIR, -o OUTPUT_DIR
Sets the output directory for generated files. Default
is ./gatk_cmdline_tools/<VERSION>/
--include INCLUDE Only generate this file (note, CommandLinkGATK has to
be generated for v3.x)
--dev Enable network caching and overwriting of the
generated files (for development purposes). Requires
--dev Enable --use_cache and overwriting of the generated
files (for development purposes). Requires
requests_cache to be installed
--use_cache [CACHE_LOCATION]
Use requests_cache, using the cache at CACHE_LOCATION,
or 'cache' if not specified. Default is False.
--no_docker Make the generated CWL files not use docker
containers. Default is False.
--docker_image_name DOCKER_IMAGE_NAME, -c DOCKER_IMAGE_NAME
Expand All @@ -53,9 +57,9 @@ optional arguments:
/gatk/gatk.jar' for gatk 4.x
```

This has been tested on versions 3.5-3.8 and generates files for version 4 (though some parameters are unknown and default to outputting a string).
This has been tested on versions 3.5-0 to 3.8-0 and 4.beta.6.

The input parameters are the same as in the documentation, with the addition of `refIndex` and `refDict` which are required parameters that specify the index and dict file of the reference genome.
The parameters generated are the same as you would need to specify on the command line, with "--" stripped from the beggining.

To add tags to arguments that have a file type, add to the parameter `<NAME>_tags`. e.g. to output the parameter `--variant:vcf path\to\file`, use the input:
```yml
Expand All @@ -66,6 +70,8 @@ variant:
variant_tags: [vcf]
```
For convenience, you can also specify any array input argument as a single element.
The cwl files will be outputted to `gatk_cmdline_tools/<VERSION>/cwl` and the JSON files given by the documentation to `gatk_cmdline_tools/<VERSION>/json`.

## Generated CWL files
Expand Down Expand Up @@ -99,8 +105,8 @@ You can also run the tests in parallel with `-n` to improve performance
## Limitations:

- The parameter `annotation` (for example, in [HaplotypeCaller](https://software.broadinstitute.org/gatk/documentation/tooldocs/current/org_broadinstitute_gatk_tools_walkers_haplotypecaller_HaplotypeCaller.php#--annotation)) is specified to take in a string in the generated CWL file, not an enumeration of all the possible options
- All parameters that you can pass to read filters that don't conflict with tool parameters are included and they are marked as optional and no default arguments are specified
- All parameters that you can pass to read filters that don't conflict with tool parameters are included and they are marked as optional

## Creating a new version

To create a `gatk_cmdline_tools.zip` zip file containing all the generated cwl files for gatk versions 3.5, 3.6, 3.7 and 3.8, run `bash build.sh`. This file is uploaded as a release on GitHub for every new release of this package.
To create a `gatk_cmdline_tools.zip` zip file containing all the generated cwl files for gatk versions 3.5, 3.6, 3.7, 3.8 and 4.beta.6, run `bash build.sh`. This file is uploaded as a release on GitHub for every new release of this package.
35 changes: 20 additions & 15 deletions build.sh
Original file line number Diff line number Diff line change
Expand Up @@ -11,19 +11,22 @@ tarbase="gatk-cwl-generator-${generator_version}-gatk_cmdline_tools"

tmpdir=$(mktemp -d)
python_bin=$(which python3)
echo "Using ${python_bin} to generate temporary virtualenv ${tmpdir}/venv"
set -x
${python_bin} -m virtualenv "${tmpdir}/venv"
set +x
echo "Activating virtualenv in ${tmpdir}/venv"
set +u # virtualenv activate script references unset vars
. "${tmpdir}/venv/bin/activate"
set -u

echo "Installing requirements in virtualenv"
set -x
pip install -r requirements.txt
set +x
if [ -z "${USE_EXISTING_PYTHON+x}" ]; then
echo "Using ${python_bin} to generate temporary virtualenv ${tmpdir}/venv"
set -x
${python_bin} -m virtualenv "${tmpdir}/venv" -p python3
set +x
echo "Activating virtualenv in ${tmpdir}/venv"
set +u # virtualenv activate script references unset vars
. "${tmpdir}/venv/bin/activate"
set -u
echo "Installing requirements in virtualenv"
set -x
pip install -r requirements.txt
set +x
else
echo "Using existing python enviroment"
fi

builddir="${tmpdir}/${tarbase}"
mkdir -p "${builddir}"
Expand All @@ -37,8 +40,10 @@ do
set +x
done

echo "Deactivating virtualenv"
deactivate
if [ -z "${USE_EXISTING_PYTHON+x}" ]; then
echo "Deactivating virtualenv"
deactivate
fi

echo "Generating zip file"
set -x
Expand Down
12 changes: 3 additions & 9 deletions examples/HaplotypeCaller_inputs.yml
Original file line number Diff line number Diff line change
@@ -1,18 +1,12 @@
# Example cwl inputs to GATK3's HaplotypeCaller tool

reference_sequence:
class: File
#path: /path/to/fasta/ref/file
path: ../cwl-example-data/chr22_cwl_test.fa
refIndex:
class: File
#path: /path/to/index/file
path: ../cwl-example-data/chr22_cwl_test.fa.fai
refDict:
class: File
#path: /path/to/dict/file
path: ../cwl-example-data/chr22_cwl_test.fa.dict
input_file: #must be BAM or CRAM
class: File
#path: /path/to/input/file
path: ../cwl-example-data/chr22_cwl_test.cram
out: out.gvcf.gz
intervals: [chr22:10591400-10591600]
intervals: chr22:10591400-10591600
2 changes: 1 addition & 1 deletion gatkcwlgenerator/VERSION
Original file line number Diff line number Diff line change
@@ -1 +1 @@
1.4.0
2.0
2 changes: 1 addition & 1 deletion gatkcwlgenerator/cwl_ast.py
Original file line number Diff line number Diff line change
Expand Up @@ -118,4 +118,4 @@ def get_cwl_object(self):
return [
"null",
inner_cwl_object
]
]
32 changes: 23 additions & 9 deletions gatkcwlgenerator/gen_cwl_arg.py
Original file line number Diff line number Diff line change
Expand Up @@ -39,7 +39,12 @@ def GATK_type_to_CWL_type(gatk_type):
# Example: https://software.broadinstitute.org/gatk/documentation/tooldocs/current/org_broadinstitute_gatk_tools_walkers_coverage_DepthOfCoverage.php#--partitionType
"partition": ["readgroup", "sample", "library", "platform", "center",
"sample_by_platform", "sample_by_center", "sample_by_platform_by_center"],
"type": ['INDEL', 'SNP', 'MIXED', 'MNP', 'SYMBOLIC', 'NO_VARIATION']
# NOTE: this actually refers to VariantContext.Type in the gatk 3 source code
"type": ['INDEL', 'SNP', 'MIXED', 'MNP', 'SYMBOLIC', 'NO_VARIATION'],
# from https://git.io/vNmFy
"sparkcollectors": ["CollectInsertSizeMetrics", "CollectQualityYieldMetrics"],
# from https://git.io/vNmAe
"metricaccumulationlevel": ["ALL_READS", "SAMPLE", "LIBRARY", "READ_GROUP"]
}

gatk_type = gatk_type.lower()
Expand Down Expand Up @@ -86,6 +91,10 @@ def get_base_CWL_type_for_argument(argument):
elif is_output_argument(argument):
gatk_type = "string"

# Patch --reference in gatk 4
if prefix == "--reference":
gatk_type = "File"

try:
cwl_type = GATK_type_to_CWL_type(gatk_type)
except UnknownGATKTypeError as error:
Expand Down Expand Up @@ -114,7 +123,12 @@ def get_output_default_arg(argument):
"""
# Output types are defined to be keys of output_type_to_file_ext, so
# this should not error
return get_arg_id(argument) + output_type_to_file_ext[argument["type"]]
for output_type in output_type_to_file_ext:
if argument["type"] in output_type:
return get_arg_id(argument) + output_type_to_file_ext[output_type]

# The definition of is_output_argument should mean this is never reached
raise Exception("Output argument should be defined in output_type_to_file_ext")

def get_input_objects(argument):
"""
Expand All @@ -141,10 +155,10 @@ def handle_required(typ):

array_node = cwl_type.find_node(lambda node: isinstance(node, CWLArrayType))
if array_node is not None:
if has_file_type:
array_node.add_input_binding({
"valueFrom": "$(null)"
})
# NOTE: this is fixing the issue at https://github.com/common-workflow-language/cwltool/issues/593
array_node.add_input_binding({
"valueFrom": "$(null)"
})

has_array_type = True

Expand Down Expand Up @@ -172,9 +186,9 @@ def handle_required(typ):
if is_arg_with_default(argument) and is_output_argument(argument):
base_cwl_arg["default"] = get_output_default_arg(argument)

if argument["name"] == "--reference_sequence":
if arg_id == "reference_sequence" or arg_id == "reference":
base_cwl_arg["secondaryFiles"] = [".fai", "^.dict"]
elif "requires" in argument["fulltext"] and "files" in argument["fulltext"]:
elif arg_id == "input_file" or arg_id == "input":
base_cwl_arg["secondaryFiles"] = "$(self.basename + self.nameext.replace('m','i'))"

if has_file_type:
Expand Down Expand Up @@ -217,7 +231,7 @@ def is_output_argument(argument):
"""
Returns whether this argument's type indicates it's an output argument
"""
return argument["type"] in output_type_to_file_ext.keys()
return any(output_type in argument["type"] for output_type in output_type_to_file_ext)


def get_output_json(argument):
Expand Down
6 changes: 6 additions & 0 deletions gatkcwlgenerator/helpers.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
"""
A collection of helper functions
"""

def is_gatk_3(version):
return not version.startswith("4")
11 changes: 5 additions & 6 deletions gatkcwlgenerator/js_libary.js
Original file line number Diff line number Diff line change
Expand Up @@ -12,12 +12,7 @@ function applyTagsToArgument(prefix, tags){
return null;
}
else if(!tags){
if(Array.isArray(self)){
return generateArrayCmd(prefix);
}
else{
return [prefix, self];
}
return generateArrayCmd(prefix);
}
else{
function addTagToArgument(tagObject, argument){
Expand Down Expand Up @@ -53,6 +48,10 @@ function generateArrayCmd(prefix){
* The issue that this solves is documented here:
* https://www.biostars.org/p/258414/#260140
*/
if(!self){
return null;
}

if(!Array.isArray(self)){
self = [self];
}
Expand Down
29 changes: 13 additions & 16 deletions gatkcwlgenerator/json2cwl.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,8 +3,10 @@
"""

from ruamel import yaml
from ruamel.yaml.scalarstring import PreservedScalarString
import os
from .gen_cwl_arg import get_input_objects, get_output_json, is_output_argument
from .helpers import is_gatk_3
import re

invalid_args = [
Expand All @@ -25,10 +27,7 @@ def cwl_generator(json_, cwl):
:param cwl: A skeleton of the cwl file, which this function will complete.
"""
outputs = []
inputs = [
{"doc": "Index file of reference genome", "type": "File", "id": "refIndex"},
{"doc": "Dict file of reference genome", "type": "File", "id": "refDict"}
]
inputs = []

for argument in json_['arguments']:
if not argument['name'] in invalid_args:
Expand Down Expand Up @@ -56,13 +55,6 @@ def get_js_libary():
with open(js_libary_path) as file:
return file.read()

# def minify_js(js):
# """
# Basic minification of javascript code
# """
# return re.sub("[/][*][\s\S]*?[*][/]", "", # remove /**/ comments
# js.replace(" ", "").replace("\n", "") # remove 4 spaces and new lines (assume semicolons exist)
# )

JS_LIBARY = get_js_libary()

Expand All @@ -71,13 +63,17 @@ def json2cwl(GATK_json, cwl_dir, cmd_line_options):
Make a cwl file with a given GATK json file in the cwl directory
"""

base_command = cmd_line_options.gatk_command.split(" ")

if is_gatk_3(cmd_line_options.version):
base_command.append("--analysis_type")

base_command.append(GATK_json['name'])

skeleton_cwl = {
'id': GATK_json['name'],
'cwlVersion': 'v1.0',
'baseCommand': cmd_line_options.gatk_command.split(" ") + [
"--analysis_type",
GATK_json['name']
],
'baseCommand': base_command,
'class': 'CommandLineTool',
'requirements': [
{
Expand All @@ -86,7 +82,7 @@ def json2cwl(GATK_json, cwl_dir, cmd_line_options):
{
"class": "InlineJavascriptRequirement",
"expressionLib": [
JS_LIBARY
PreservedScalarString(JS_LIBARY)
]
}
] + ([]
Expand All @@ -105,5 +101,6 @@ def json2cwl(GATK_json, cwl_dir, cmd_line_options):
GATK_json,
skeleton_cwl
)

yaml.round_trip_dump(skeleton_cwl, f) # write the file
f.close()
Loading

0 comments on commit 0abd915

Please sign in to comment.