-
Notifications
You must be signed in to change notification settings - Fork 629
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Workflow inputs definition and schema #4669
Comments
Hello there! I wanted to bring this project to the attention! It was released by apple so I will have some support and I feel like it addresses really well the needs for schema validation and progressive amendment (+java library). let me know what you think! |
I think its common to keep the params bundled with the config profiles; a lot of params are ultimately just paths to reference files, and you will need different paths to the files if you are on HPC vs cloud. It seems like this would break that. fwiw so far the typing of "default values" for the params has been one of the recurring headaches I have had lately, need to have a way to support
its been my experience that I have an R script where this is just one example but it highlights a relatively common type of situation; the limitations with similarly, trying to have e.g. SLURM default paths vs. AWS S3 defaults paths, based on profile, supported in the |
What I've been thinking is that the As for your R example, I think the best practice is to encode that convention in the process that calls the R script like so: """
Rscript script.R --val ${params.Rscript_val ?: 'NA'}
""" |
Thinking more on this, and with some inspiration from the output DSL prototype, here's a sketch for an input DSL for fetchngs: inputs {
input {
type Path // type: 'file'
required true
mimetype 'text/csv'
pattern '^\\S+\\.(csv|tsv|txt)$'
schema 'assets/schema_input.json'
description 'File containing SRA/ENA/GEO/DDBJ identifiers one per line to download their associated metadata and FastQ files.'
}
// ...
nf_core_pipeline {
type String
description '''Name of supported nf-core pipeline e.g. 'rnaseq'. A samplesheet for direct use with the pipeline will be created with the appropriate columns.'''
enum 'rnaseq', 'atacseq', 'viralrecon', 'taxprofiler'
}
nf_core_rnaseq_strandedness {
type String
description '''Value for 'strandedness' entry added to samplesheet created when using '--nf_core_pipeline rnaseq'.'''
help '''The default is 'auto' which can be used with nf-core/rnaseq v3.10 onwards to auto-detect strandedness during the pipeline execution.'''
defaultValue 'auto'
}
// ...
} Notes:
I really like the input DSL, but the circular dependency with the config is a problem. I have listed a few ideas to address this, though none of them are complete IMO. Maybe some combination of them will do the trick. Need to think further on the relationship between params, config, and script |
One way to solve the circular dependency might be to restrict the scope of params to only things that are actually workflow inputs. In other words, don't allow the config to reference params at all. Then you could define params in the pipeline code (like the output definition) and generate a YAML schema for use by external tools. The config file should still be able to set params. Nextflow would be able to validate them at runtime because it could evaluate the params definition before it evaluates the entry workflow and output definition (the only two places where params can be used). The params that are typically used in the config file tend to be external to the workflow itself, for example:
The main consequence is that you would only be able to use params to control workflow inputs and not config settings. Things that might previously be an additional CLI option: $ nextflow run nf-core/rnaseq --max_cpus 24 Would become config: $ cat extra.config
process.resourceLimits.cpus = 24
$ nextflow run nf-core/rnaseq -c extra.config I think I life this tradeoff, though I can appreciate why many people might like the power and convenience of params as they currently work. Maybe this would be a good long-term goal to work towards. First we focus on incorporating the param schema, then we can think about adding a params definition alongside the entry workflow. |
Spun off from #2723
Params can currently be defined in config files (including profiles), params files, CLI options, and the pipeline code itself. This creates the potential for much confusion around how these various sources are resolved (see #2662). Additionally, params are not typed, and while the CLI can cast command line params based on regular expressions, it can also backfire when e.g. a string param is given a value that "looks" like a number.
Instead, params should be defined in a single place with metadata such as type, default value, description, etc. Benefits are:
The nf-core parameter schema (
nextflow_schema.json
) as well as the nf-validation plugin are excellent steps in this direction, and the solution may be to simply incorporate them into Nextflow.For backwards compatibility, we may allow params to be set in config files and pipeline code, but this would essentially be overriding the default value rather than "defining" the param, and it should be discouraged in favor of putting everything in the parameter schema. That being said, it can be useful to set params from a profile, such as a test profile that provides some test data, so this use case should be supported.
The main question that I see is whether the schema should be in a separate JSON/YAML file (as it currently is in nf-core) or in the pipeline code as part of the top-level workflow definition.
The text was updated successfully, but these errors were encountered: