Evolution of Nextflow configuration file #2723

pditommaso · 2022-03-14T18:50:59Z

Rationale

The configuration file is a key component of the Nextflow framework since it allows workflow developers to decouple the pipeline logic from the execution parameters and infrastructure deployment settings.

The current configuration mechanism is extremely flexible and powerful. It uses a customised GroovySlurper class with extended to handle multiple profiles and config file inclusions.

A Nextflow configuration file is ultimately a compiled Java class on steroids.

However, this has serious drawbacks:

The evaluation of config file, requires Java class compilation and interpretation, which can be time-consuming compared to text file parsing.
The evaluation of the config requires ultimately the execution of JVM code, which could be explored by malicious users and pose a non-trivial security burden to be handled in a managed environment.
The current configuration allows the use of arbitrary imperative programming statements within a config file, which is a bad practice, making the config logic poorly readable.
The parser implementation is becoming including hard to be maintained, and patches easily introduce unexpected side-effects, resulting in increased stability issues.

Goals

Explore a new configuration mechanism that can be used in alternative to the current nextflow config
Have a well-defined structure and grammar
Human-readable
Allow grouping correlated setting
Allow the definition of named groups of settings (i.e. profiles)
Allow inheritance
Allow config files inclusion

Non-goals

Compatible with current nextlow.config syntax
Allowing custom functions

Action

This issue should explore possible alternative implementation and define a migration strategy.

Currently the best alternatives are:

Hashicorp HCL: this is essentially the configuration language used by Terraform and other Hashicorp tools. It's quite popular, however there isn't a complete parsed for JVM world, other than this project.
Ligthbed HCON: this is essentially, very curated configuration library. The syntax has a lot of similarities with the current nextflow.config syntax.

jorgeaguileraseqera · 2022-03-23T11:29:12Z

I think the problem comes from the current implementation because we are overwriting the current ConfigSlurper trying to do more things at the same time instead of first parsing the configuration and after applying them

For example, we implement the profile feature trying to merge the different configurations on the fly but maybe is to "easy" as use the environment feature implemented in ConfigSlurper (https://wjw465150.github.io/blog/Groovy/my_data/%E5%AE%98%E6%96%B9%E4%BE%8B%E5%AD%90/Groovy%20-%20ConfigSlurper.htm) as many profiles the user wants and after merge all of them

About malicious code we can execute the evaluation of the configuration into a secured context with minimal imports and no access to the rest of the application

pditommaso · 2022-03-24T17:17:19Z

Likely, another side of the problem is that to parse the file you need to compile the corresponding groovy code. Powerful but in some context cannot be done.

ewels · 2022-04-27T08:33:48Z

What do you think about wrapping some of the functionality from the @nf-core JSON Schema into the new config file format? I'm mostly thinking variable type, but enum, pattern and required would also be nice. It would be nice to move as much of the custom nf-core validation code into core-Nextflow as possible 😊 (we've previously been aiming at a plugin, but this would be even better)

Some of the longer-form things like description and help text may be better suited to staying in the JSON Schema.. Keeping that file also allows us to customise and play with future extensions easily, so I'm not suggesting we kill it completely (though I'm not completely against that either).

ewels · 2022-04-27T08:36:02Z

Also, an early heads-up: we make heavy use of includeConfig within @nf-core configs to pull remote config files (eg. shared nf-core/configs) and it works brilliantly. So if it's possible to keep this functionality of including arbitrary config URLs then that would be ideal 👍🏻

pditommaso · 2022-04-27T12:56:33Z

Static typing and JSON schema start to smell overkilling. IMO it should be the good parts of the current system i.e. variables interpolation, profiles, includeConfig without the bad parts ie closures, custom functions, conditional statements, etc

ewels · 2022-04-27T13:03:18Z

Yeah I thought you might say that 😅 But if it's optional then it needn't be much? Hypothetical example based on the current syntax:

params {
    foo = 'bar', required: true
    baz = 'qux', type: 'int'
}

For many people this would be sufficient to completely remove nextflow_schema.json and all dependencies on the nf-core toolchain, whilst giving much more power for catching user error.

mbwhelan · 2022-04-27T13:24:11Z

The current model has similarities to Spring boot configuration properties. Hierarchical config, profiles, variable interpretation are must haves. https://docs.spring.io/spring-boot/docs/current/reference/html/features.html#features.external-config

That model is great for Java apps. The validation logic though is separate and could be complex to recreate in a pipeline.

We want to enforce our pipelines to use nf-cores models, but it has to be customized for our internal needs. If some of those validation concepts or other items could be brought in natively or via plugins that would be awesome.

jorgeaguileraseqera · 2022-05-10T13:58:09Z

Another interesting format can be toml https://toml.io/en/

I'm playing with a java parser (https://github.com/mwanji/toml4j) and it's very interesting because it can map configuration files against Java classes instead simple maps

an (invented) configuration could be:

includeConfig = [  'b.toml', 'http://localhost/remote.toml' ]
[aws]
        accessKey = '123'
[docker]
    enabled = false
    envWhitelist = ['a', 'b', "c"]
    memory = 210
            
[profiles]
    [profiles.aws]
        [profiles.aws.process]
            executor = 'sge'
            queue = 'long'
            memory = '10GB'
        [profiles.cloud.process]
            executor = 'sge'
            queue = 'long'
            memory = '1GB'
            data = [ ["delta", "phi"], [3.14] ]

ewels · 2022-05-10T14:10:03Z

Maybe I missed this in the thread already, but is there a reason that we can't use YAML? I feel like that's the format that most people are used to..

Equivalent YAML of the TOML above could be:

!include b.yml
!include "http://localhost/remote.toml"
aws:
    accessKey: 123
docker:
    enabled: false
    envWhitelist: ['a', 'b', "c"]
    memory: 210
            
profiles:
    aws:
        process:
            executor: sge
            queue: long
            memory: 10GB
    cloud:
        process:
            executor: sge
            queue: long
            memory: 1GB
            data:
                - ["delta", "phi"]
                - [3.14]

The !include bit is a bit non-standard, but similar things are used by a lot of other projects (eg. pyyaml-include).

YAML has the advantage that many Nextflow users are already using it, with the -params-file option. The only difference here would be to move it up a level, so that the YAML doesn't just go to params but also across other config scopes.

jorgeaguileraseqera · 2022-05-10T14:40:26Z

From my side, I'm only investigating other formats that can be interesting to consider

of course, YAML is probably the most famous format (but also has a lot of problems) and we have libraries that allow us to parse yml files

pditommaso · 2022-05-10T15:19:33Z

Biggest problem yaml does not have variables interpolation

MrMarkW · 2022-05-10T16:00:33Z

HIML - hierarchical yaml config with variable interpolation: https://github.com/adobe/himl#interpolation
Unfortunately, it is in python and I couldn't find a java flavor.

ewels · 2022-05-10T19:07:08Z

Yeah, I was thinking YAML anchors and aliases but that's full values and not partial strings.

I was just looking at a few nextflow.config files from nf-core pipelines and thinking it wasn't so bad, but the really scary ones are in conf/modules.config. For example, in nf-core/rnaseq conf/modules.config we have stuff like this:

def rseqc_modules = params.rseqc_modules ? params.rseqc_modules.split(',').collect{ it.trim().toLowerCase() } : []

process {
    publishDir = [
        path: { "${params.outdir}/${task.process.tokenize(':')[-1].tokenize('_')[0].toLowerCase()}" },
        mode: params.publish_dir_mode,
        saveAs: { filename -> filename.equals('versions.yml') ? null : filename }
    ]

   // ...

    withName: 'SALMON_INDEX' {
        ext.args   = params.gencode ? '--gencode' : ''
        publishDir = [
            path: { "${params.outdir}/genome/index" },
            mode: params.publish_dir_mode,
            saveAs: { filename -> filename.equals('versions.yml') ? null : filename },
            enabled: params.save_reference
        ]
    }
}

if (!(params.skip_fastqc || params.skip_qc)) {
    process {
        withName: '.*:FASTQC_UMITOOLS_TRIMGALORE:FASTQC' {
            ext.args   = '--quiet'
        }
    }
}

I can't imagine all of this surviving a new format, but the less that we have to rewrite the better..

drpatelh · 2022-05-10T19:13:41Z

Yep, some of that syntax can be improved due to recent additions to NF (e.g. #2700) but it is really a side-effect of truly modularising within and between pipelines.

I think it goes without saying that it would be nice to test any prototype config format with an nf-core pipeline because I'm sure we will be able to unravel some edge cases!

cjw85 · 2022-07-12T15:16:00Z

yaml, gets hairy when when it comes to anything non-trivial -- check out issues in gitlab CI support. toml is a nicer version of the same sort of thing.

emyr666 · 2023-11-23T12:01:52Z

Am interested in this line you have in the original post...

Have a well-defined structure and grammar

By this do you mean to have a formal grammar for the NextFlow DSL e.g. using BNF notation ?
This would be really useful.

bentsherman · 2024-01-16T18:54:52Z

Some of the problems described here (like with the nf-core configs) can be addressed in other ways like module config (#4510) and resourceLimits directive (#2911). So I think in this issue we can focus on parsing and variable evaluation.

I have developed a custom parser for Nextflow config files which defines a formal grammar that is much more strict:

basic assignments
config blocks
config includes
values can be any Groovy expression including closures
no function definitions, if statements, arbitrary Groovy code, etc

It would allow us to have much better control over the parsing and evaluation compared to the current system. Could also be used to develop IDE tooling for config files like linting and code completion.

The key question around whether to move to a new format or improve the current one is how to handle variable interpolation:

declarative formats like YAML can be easily parsed by external tools. however, they will need to wrap closures and other dynamic expressions in a string, which is a poor developer experience
Nextflow config supports Groovy syntax and the developer experience can be improved greatly with a custom parser. however, it cannot be parsed easily by external tools

Would it be unreasonable to support both Nextflow config and YAML/TOML? Then we could say "if you need to use dynamic expressions, use Nextflow config, otherwise YAML is fine". We could still support config inclusion and dynamic strings in YAML, but we wouldn't have to come up with some kludgy solution to support Groovy closures in YAML. And with the custom parser, we could bring the developer experience for Nextflow configs up to par with YAML.

cjw85 · 2024-01-16T19:47:35Z

You're dangerously close to making GroovyMake with Groovy code embedded in YAML 🤣 .

Your choice is a conundrum, on the one hand it represents a loss of functionality to ditch the current system in favour of something like TOML that could be read by external systems. (An aside, if you allow groovy expressions as strings in those the external tools would need to know how to interpret groovy). On the other, its very powerful for external systems to be able to read the configuration files and be able to perform tasks such as parameter validation.

The central issue here is that the current config files are overloaded in their duty. The can contain both data and code; what's needed I feel is a better separation of those. The nf-core idea of parameter schemas was a step in this direction, however they are not tightly integrated into Nextflow and don't replace the native config files; which leads to all manner of tedious issues.

So perhaps you want a system whereby best practice is to reserve the current nexflow config files for dynamic configuration, and data resides in a document freely parseable by any external tool. You could likely get a long way to this goal by writing a joint parser for nextflow config files and nf-core schemas that populates the former with data from the later.

bentsherman · 2024-01-16T20:00:56Z

I don't think the parameter schema should be part of the config file. It should be in the top-level workflow definition as inputs, or in a separate file like the nf-core parameter schema. Config files should be able to reference params and even set default values, but they should not be the source of truth for params.

So now I'm wondering, if you take away that example, what other use case is there for loading a Nextflow config file in an external tool? If there aren't any others, then maybe external tooling isn't important and we can safely stick to the current config format.

ewels · 2024-01-17T07:32:14Z

The problem with keeping the parameter schema separate from the config pipeline defaults is that the two need to be kept in sync. It's irritating to need to add the same thing in two places as a pipeline developer.

There are two things that need to be loaded by external tools - the schema (to build launch tools / render documentation) and the config (to pre-populate forms and validate inputs before running Nextflow).

cjw85 · 2024-01-17T10:36:32Z

I can't tell who's violently agreeing with who.

I think I'm violently agreeing with @ewels: use something generic like the nf-core schema for loading and understanding static data that may ultimately flow from the user. Use something like the currently nextflow config to provide flexibility to developers and power users for augmenting the runtime environment. This would mean narrowing the scope of what the current nextflow config can do (e.g. it won't contain a params { } stanza`).

bentsherman · 2024-01-17T14:59:59Z

@ewels when you say that external tools need to load the schema and the config, by "config" do you mean just the params block?

I'm thinking that config files should still be allowed to set param defaults for backwards compatibility (and because it is useful to do in profiles), but users should be encouraged to specify default values in the schema file so that there is one source of truth as you say. As Chris said, users should avoid having a params block in their config file.

I think we are all in violent agreement

ewels · 2024-01-17T16:45:52Z

Yes, I think so 👍🏻 We already encourage people to use -params-yaml, so I guess the missing piece would be to get Nextflow to parse and use nextflow_schema.json? (or equivalent). Then the params could be removed from nextflow.config and the pipeline developer would only need to care about the schema file.

cjw85 · 2024-01-17T16:52:05Z

...and ~~remove~~ override nextflow's dodgy type casting in favour of whats declared in the schema

bentsherman · 2024-01-17T17:10:20Z

This is getting away from the config file, but what do you think about putting the params schema in the pipeline code? The top-level anonymous workflow could have an "input" section that defines inputs with type, default value, etc. This way, the language server could ensure that the params are used correctly in the pipeline code. Nextflow could export this definition to YAML/JSON for use with external tools

cjw85 · 2024-01-17T17:29:24Z

This is naturally not a single source of truth: its just a way of generating a secondary file which could become out of sync with the primary document, just like the current situation.

bentsherman · 2024-01-17T19:28:33Z

Let's continue the discussion of the params schema here: #4669

In any case, I think we agree that the params definition should be separate from the config file. That tells me that it would be fine to keep the current config format and implement a custom parser to enforce a simpler syntax without the Groovy quirks.

pditommaso mentioned this issue Mar 21, 2022

PublishDir allows closures for mode, enabled, overwrite fields #2432

Closed

pditommaso mentioned this issue Mar 24, 2022

Rename pod directive to podOptions #2747

Closed

bentsherman added the lang/config label Aug 9, 2022

bentsherman mentioned this issue Aug 29, 2022

process directive not always applying to run #3154

Closed

This comment was marked as outdated.

Sign in to view

stale bot added the stale label Jan 16, 2023

ewels added pinned and removed stale labels Jan 16, 2023

abhi18av mentioned this issue Jun 20, 2023

Additional manifest config fields #4034

Closed

ewels mentioned this issue Sep 14, 2023

Native NF validation via JSON parameter schema #2434

Closed

bentsherman mentioned this issue Oct 6, 2023

Evolution of Nextflow command line interface #3595

Open

bentsherman mentioned this issue Jan 17, 2024

Workflow inputs definition and schema #4669

Open

bentsherman linked a pull request Feb 14, 2024 that will close this issue

Config parser (and loader) #4744

Draft

4 tasks

stevekm mentioned this issue Mar 7, 2024

Workflow output definitions and schema #4670

Closed

bentsherman mentioned this issue Apr 11, 2024

Azure: Params File path cannot be a blob container 'az://' #4904

Open

pditommaso mentioned this issue Oct 18, 2024

Accept config from YAML/JSON #5398

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Evolution of Nextflow configuration file #2723

Evolution of Nextflow configuration file #2723

pditommaso commented Mar 14, 2022 •

edited

Loading

jorgeaguileraseqera commented Mar 23, 2022

pditommaso commented Mar 24, 2022

ewels commented Apr 27, 2022

ewels commented Apr 27, 2022

pditommaso commented Apr 27, 2022

ewels commented Apr 27, 2022

mbwhelan commented Apr 27, 2022

jorgeaguileraseqera commented May 10, 2022 •

edited by ewels

Loading

ewels commented May 10, 2022

jorgeaguileraseqera commented May 10, 2022

pditommaso commented May 10, 2022

MrMarkW commented May 10, 2022

ewels commented May 10, 2022

drpatelh commented May 10, 2022

cjw85 commented Jul 12, 2022

This comment was marked as outdated.

emyr666 commented Nov 23, 2023

bentsherman commented Jan 16, 2024

cjw85 commented Jan 16, 2024

bentsherman commented Jan 16, 2024

ewels commented Jan 17, 2024 •

edited

Loading

cjw85 commented Jan 17, 2024

bentsherman commented Jan 17, 2024

ewels commented Jan 17, 2024

cjw85 commented Jan 17, 2024 •

edited

Loading

bentsherman commented Jan 17, 2024

cjw85 commented Jan 17, 2024

bentsherman commented Jan 17, 2024

Evolution of Nextflow configuration file #2723

Evolution of Nextflow configuration file #2723

Comments

pditommaso commented Mar 14, 2022 • edited Loading

Rationale

Goals

Non-goals

Action

jorgeaguileraseqera commented Mar 23, 2022

pditommaso commented Mar 24, 2022

ewels commented Apr 27, 2022

ewels commented Apr 27, 2022

pditommaso commented Apr 27, 2022

ewels commented Apr 27, 2022

mbwhelan commented Apr 27, 2022

jorgeaguileraseqera commented May 10, 2022 • edited by ewels Loading

ewels commented May 10, 2022

jorgeaguileraseqera commented May 10, 2022

pditommaso commented May 10, 2022

MrMarkW commented May 10, 2022

ewels commented May 10, 2022

drpatelh commented May 10, 2022

cjw85 commented Jul 12, 2022

This comment was marked as outdated.

emyr666 commented Nov 23, 2023

bentsherman commented Jan 16, 2024

cjw85 commented Jan 16, 2024

bentsherman commented Jan 16, 2024

ewels commented Jan 17, 2024 • edited Loading

cjw85 commented Jan 17, 2024

bentsherman commented Jan 17, 2024

ewels commented Jan 17, 2024

cjw85 commented Jan 17, 2024 • edited Loading

bentsherman commented Jan 17, 2024

cjw85 commented Jan 17, 2024

bentsherman commented Jan 17, 2024

pditommaso commented Mar 14, 2022 •

edited

Loading

jorgeaguileraseqera commented May 10, 2022 •

edited by ewels

Loading

ewels commented Jan 17, 2024 •

edited

Loading

cjw85 commented Jan 17, 2024 •

edited

Loading