Recipes of recipes, and turtles all the way down #698

o-smirnov · 2021-02-10T06:19:37Z

o-smirnov
Feb 10, 2021
Maintainer

@SpheMakh something you said at the Caracal meeting (recipe of recipes), plus the discussion on how to break up the selfcal worker there, got me thinking. You may have thought of some or all of this already, but maybe we should formalize a little. So here's a brain dump.

Desirata

A recipe should be a structured config object (so it can be saved/loaded as YaML, and generally passed around, e.g. Saving recipe information for a future run #686)
Maximum correctness checking up front. So making it structured config object in the OmegaConf sense works well. That way there's a schema based on Python @dataclasses behind it, and checking is done at construction.
Recipes should be nestable
Inputs/outputs should be well-defined, for things like Merge usecwl branch #638 to work
You pass the recipe object to a runner (e.g. the local Docker runner, but in the future e.g. a CWL scheduler), and off it goes.

Basic types

I'm going to try to give unambiguous definitions for things, in most cases it should be clear how to express it in terms of Python's typing mechanism.

A Parameter has a type, an iomode (input/output/mixed), and a default. Parameters with no default are, by definition, mandatory.

A Parset is a (possibly nested) set of named parameters. I.e. Mapping[str, Union[Parameter, Mapping[str, Union[...]]. Why nested? See recipe parameters below. And anyway, cabs with large numbers of parameters could be made clearer by structuring the parameters into related groups.

A Job defines a parset, and a payload. Examples of jobs are: Cab, Recipe, App (#689), maybe a chunk of arbitrary callable Python. To exec a job, one must supply values for (at least) the mandatory parameters.

Recipe

A Recipe is a sequence of named steps (OrderedDict[str, Step]).
A Step is an invocation of a job with a set of (possibly incomplete) parameter assignments

Recipe parameters

So what constitutes a recipe's parset? I think we need a default policy that makes something sensible from the contents of the recipe, plus a way to override this policy if so desired.

Suggested default policy: a recipe's parset consists of (A) all the input/mixed parameters of the first step, plus (B) all the output/mixed parameters of the final step, plus (C) any mandatory parameters not defined for the intermediate steps, nested under their step names.

So as purely notional example, let's say you have a Transform job (mixed parameter ms, input parameter field, optional parameter selection='') an Average job (ms, optional input parameters timeavg=1 and chanavg=1), and a Calibrate job (mixed parameter ms, output parameter caltable, mandatory input solint, bunch of optional input parameters), and you string the recipe together like so:

transform:
  job: Transform
  field: 0
average:
  job: Average
  ms: {..transform.ms}
  chanavg: !EXPOSE   # exposes parameter at recipe level
cal:
  job: Calibrate
  ms: {..average.ms}

...then your recipe's "public" parset looks like this by default:

ms:                    # unassigned input/mixed parameter from first step gets promoted to top level
field:                 # ditto
selection: ""
cal:
  solint:              # mandatory parameter from intermediate step, hence nested as ``cal.solint``
average:
  chanavg: 1           # optional parameter from intermediate step, but exposed explicitly
caltable:              # unassigned output parameter from final step gets promoted to top level

If the recipe designer does not like this default parset, they can define an explicit parameters section in their recipe, which explicitly defines the parameters of the recipe (in terms of what steps' parameters to map them onto). Stimela will then need to check when the recipe is constructed, to make sure that every mandatory parameter of every step is mapped.

Cross-referencing parameters

There was already an example of this above ({..average.ms}), which is a standard OmegaConf construct.

We can think of extending this with something like {!previous.ms} (to refer to a parameter of the preceding step), or {!recipe.ms} (to refer to a top-level parameter of the recipe) to make recipes more composable.

Furthermore, I suggest the following rule: if a mandatory output/mixed parameter of a step is missing, but a subsequent step makes a cross-reference to this parameter, then Stimela should generate a temporary filename for it. (Thus, we can automatically plumb intermediate steps together without needing to name the intermediate file products explicitly).

Enabling and conditionals

Every step should recognize an enable field (true if missing), so that it can be easily disabled with enable: false.

From this, it is but a hop and a jump to supporting conditionals: enable can cross-reference a parameter of another step, thus making this step conditional on the result of a preceding step. (We might need to eventually think about richer syntax that a {}-substitution, but this is already pretty powerful).

Looping!

A germ of an idea, but from this, it is very easy to make a recipe that is defined as a loop or an iterator. Add something like this to the recipe conf:

_for:
    x: [1,2,3]
    y: [a,b,c]

...which then tells Stimela to repeat the recipe three times, with x=1, y=a, then with x=2, y=b, then with x=3, y=c. The values can be cross-referenced from steps as something like {!recipe.x}. So in the body of the recipe, there shouldn't really be any distinction between referencing a parameter, or referencing a loop variable.

You can also think of adding loop conditionals, i.e. _while: condition and _until: condition, where "condition" is as for enable above.

o-smirnov · 2021-02-10T06:32:43Z

o-smirnov
Feb 10, 2021
Maintainer Author

Ome design decision to make here: should steps and parameter values be specified at the top level of a (recipe or step) definition, or tucked away in steps and params subsections?

The latter case is more wordy (including in code... i.e. you'd be saying things like recipe.steps.step1.params.foo), but ensures that parameters don't clash with reserved keys:

my_recipe:
  job: Recipe
  for:
    x: 1,2,3
  steps:
    step1:
      job: Average
      params:
        foo: 0
    step2:
      job: Recipe
      name: previously_defined_recipe
      params:
        bar: {!recipe.x}

The former case will make for more compact definitions (and in code, recipe.step1.foo), but we'll probably need to use something like _ to distinguish reserved keys:

my_recipe:
  _job: Recipe
  _for:
    x: 1,2,3
  step1:
    _job: Average
    foo: bar
  step2:
    _job: Recipe
    _name: previously_defined_recipe
    x: {!recipe.x}

8 replies

o-smirnov Feb 12, 2021
Maintainer Author

Can we not specify that via the default flag? I.e. default: none for a non-mandatory parameter with no default value.

o-smirnov Feb 12, 2021
Maintainer Author

We have steps.cab for a cab, should we allow this also take recipe, or add a steps.recipe key for recipes?

Well, I think each step's specification should unambiguously identify it as being a cab or a recipe. Two alternatives for that: (a) either we have a type field which says cab or recipe, or (b) we look for the presence of a cab field or a recipe field to identify the step type. The rest of the specification needs to be unified (so you can cut-and-paste a recipe directly into a step of another recipe).

This is getting a bit too advanced for just handwaving, because we need to make sure this will play nicely with Python typing and the whole StructuredConfig thing. How about we make a demo directory on the branch, and check in some snippets of code? Each demo should be a self-contained python file which declares the @dataclasses in use, then loads a corresponding structured config from a YaML string. I think getting our hands dirty can focus the discussion greatly.

(I'm in APR land today, but will have time to play with code tomorrow, will check in some demos then.)

SpheMakh Feb 12, 2021
Maintainer

A boolean steps.<step name>.recipe flag is the cleanest way. Then steps.<step name>.cab should be a path to recipe (or even a dictionary)

o-smirnov Feb 12, 2021
Maintainer Author

If the cab field can be a recipe dictionary, then continuing to call it "cab" gets a little confusing under current terminology. So either:

we officially redefine "cab" to mean anything runnable, i.e. a recipe is also a cab, or a container image instance is a cab. Such a change in terminology makes sense within the whole train metaphor, but then we need a new term for the more specific thing we refer to as cabs now.
we can call it steps.<step name>.cargo? Then a string cargo is taken to be a cab name (in the current meaning of "cab"), while a dict cargo is taken to be a recipe.

SpheMakh Feb 12, 2021
Maintainer

Yeah, the terminology is a bit loose, it all depends on which side of the bed I wake from. I like the second option, also note that a python callable can also be cab so we have three cases (string, dict, callable)

SpheMakh · 2021-02-12T10:10:53Z

SpheMakh
Feb 12, 2021
Maintainer

Dealing with recipe outputs

A recipe should have an outputs section which specifies which step's outputs should be inherited. In the (above example)[https://github.com//discussions/698#discussioncomment-360728] the output section could look like this

.
.
.
outputs:
  makems: all # all outputs from this cab
  wsclean: # Another option would be select to some of the outputs
     -  "{*-MFS-image.fits}"
     -  "{*-MFS-model.fits}"

5 replies

o-smirnov Feb 12, 2021
Maintainer Author

Yep, up above where I said

Suggested default policy: a recipe's parset consists of (A) all the input/mixed parameters of the first step, plus (B) all the output/mixed parameters of the final step, plus (C) any mandatory parameters not defined for the intermediate steps, nested under their step names.

...that's pretty much what I had in mind, except I was also thinking automation. If the recipe doesn't specify outputs, the default policy sets them to be the outputs of the final step. Otherwise, you specify recipe outputs, as you suggest.

o-smirnov Feb 12, 2021
Maintainer Author

Also, a cab definition doesn't have separate inputs and outputs sections, right? It has parameters of type input/output/mixed. The set of inputs can be derived implicitly (input+mixed), also the set of outputs (output+mixed). Should a recipe's inputs and outputs not be defined in exactly the same way?

SpheMakh Feb 12, 2021
Maintainer

Dealing with inputs and outputs is a bit of a conundrum for me. I may be complicating things though (as I often do). So the case that I have in mind is wsclean:
It takes a prefix (-name) as a string, then outputs a bunch of files (<prefix>*.fits) images. One way to deal with this is having a outputs section defined as

outputs:
  image: "{inputs.name}*-image.fits" 
  model: "{inputs.name}*-model.fits"

or

parameters:
  name:
    type: str
    io: output
    collect:
      image: "{self.name}*-image.fits" 
      model: "{self.name}*-model.fits"

I'm OK with both, which do you prefer if any of these?

o-smirnov Feb 12, 2021
Maintainer Author

Well, firstly this kind of output is a file not a string, so how about type: file (with Stimela then knowing that this is actually a string, specifying an output file.)

Secondly, in the wsclean case, this is an implicit output, in the sense that we don't get to specify the filename (unlike, say, any other package where you have full control over the filename). Stimela needs to know this somehow.

Thirdly, you have the awkward case here where the output is a collection of files, and you don't know in advance how many (though you can write a glob describing them). This is also a distinct output type I feel, type: files?

Something along the lines of

parameters:
  name:
    type: str
    io: input
    info: "base filename for output files"
  mfs_image:
    type: file
    io: output
    implicit: "{name}-MFS-image.fits"
  images:
    type: files
    io: output
    implicit: "{name}-*-image.fits"

This is not a complete picture because you can also have MFS-Q-image, etc. But something along these lines...

o-smirnov Feb 12, 2021
Maintainer Author

Speaking of wsclean and polarization: maybe this can be handled by providing two different cab definitions, one for polarized, one for I-only imaging. Reason being, if you enable Stokes QUV, the structure of its file outputs changes substantially (they get a Stokes label in them). So you may as well treat it as a separate program with a different output structure!

o-smirnov · 2021-02-13T11:47:49Z

o-smirnov
Feb 13, 2021
Maintainer Author

OK I checked in a little mock-up with my take on the above discussions. This defines a schema and loads a sample config with a nested sub-recipe: https://github.com/ratt-ru/Stimela/blob/configuratt/stimela/tests/proto_config_oms.py. Just run python proto_config_oms.py and admire the output.

Writing this, I realized that OmegeConf's substitutions are a bit too simple for this purpose. But we can use standard Python {} substitutions on parameter values to achieve the desired result.

info: 'top level recipe definition'
vars:
    ms: demo.ms
dirs:
    input: input
    output: output
steps: 
    make_ms:  # this step uses a cab
        cab: simms
        inputs:
            msname: "{recipe.vars.ms}"
            telescope: kat-7
            dtime: 1
            synthesis: 0.128
    selfcal: # this step is a nested recipe
        inputs:
            ms: "{recipe.vars.ms}"      # 'recipe' here refers to parent recipe
        outputs:
            image: final-image.fits     # overrides output filename
        recipe:
            info: "this is a generic selfcal loop"
            vars:
                scale: 30asec
                size: 256 
            _for:
                selfcal_loop: 1,2,3     # repeat three times
            steps:
                calibrate: 
                    cab: cubical
                    inputs:
                        ms: "{recipe.inputs.ms}"
                    _skip: "recipe.vars.selfcal_loop < 2"    # skip on first iteration, go straight to image
                imager:
                    cab: wsclean
                    inputs:
                        msname: "{recipe.inputs.ms}"
                        name: "image-{recipe.vars.selfcal_loop}"
                        scale: "{recipe.vars.scale}"
                        size: "{recipe.vars.size}"
                evaluate:
                    cab: aimfast
                    inputs:
                        image: "{prev.outputs.residual_image}"  # 'prev' refers to preceding step
                    _break_on: "step.outputs.dr_achieved"    # break out of recipe based on some output value. 'step' refers to this step
            # the below formally specifies the inputs and outputs of the selfcal recipe
            parameters:
                ms: 
                    type: ms
                    io: both
                    default: null
                # maps onto the output of the wsclean step
                image:
                    maps: imager.outputs.image

0 replies

o-smirnov · 2021-02-13T12:13:24Z

o-smirnov
Feb 13, 2021
Maintainer Author

Even cleaner would be to define size and scale as input parameters of the selfcal recipe (rather than vars), and use {recipe.inputs.size} and {recipe.inputs.scale} in the imager step.

0 replies

o-smirnov · 2021-02-13T14:10:09Z

o-smirnov
Feb 13, 2021
Maintainer Author

Also: https://github.com/ratt-ru/Stimela/blob/e2f9a5247bb6d686a18d1f8c38b9286ab5aedaec/stimela/config.py#L36

Should dtype not be an enum?

3 replies

SpheMakh Feb 14, 2021
Maintainer

Yes, it should be.

o-smirnov Feb 14, 2021
Maintainer Author

OK I'll leave it to you to define the enum, not sure what all the possible types are.

Please pull though, as I've made some cleanups in stimela exec.

o-smirnov Feb 14, 2021
Maintainer Author

@SpheMakh please pull again, I've simplified the exec code considerably. OmegaConf is even neater than you think. You don't need to use the "create schema -- create dict -- create conf from dict -- merge conf with schema" pattern in this instance, it's a roundabout way to do it. The simple way is to create a structured conf object, then populate it directly.

Also note that setattr(conf, key, value) is equivalent to conf[key] = value in OmegaConf, and I think the latter form is more readable.

SpheMakh · 2021-02-15T10:17:09Z

SpheMakh
Feb 15, 2021
Maintainer

What's you type

The types can be grouped into three:

Basic

int
float
str

I/O

Here I start with a capital letter because these are classes and with attributes such as path,dirpath

File
Directory

List

list : A list with elements of any of the above types

Dict

dict : A dict dictionary with values that can be any of the above types, including a dict itself.

A type can be any of the above or a list of any combination thereof .

3 replies

SpheMakh Feb 15, 2021
Maintainer

We should probably also add a pattern that will have string and regex attributes

o-smirnov Feb 15, 2021
Maintainer Author

You mean an input parameter of type pattern? Why does it have separate string and regex attributes (example?)

SpheMakh Feb 15, 2021
Maintainer

Actually, no regex required. glob.glob() should be fine. So maybe pattern should just be a boolean attributte of File and Directory.

o-smirnov · 2021-04-18T09:43:53Z

o-smirnov
Apr 18, 2021
Maintainer Author

@SpheMakh check out the latest commit. Substitutions and cross-references now work more or less fully. Try

stimela -v exec stimela/tests/test_recipe.yml msname=tmp.ms selfcal_image_name=foo selfcal_image_size=256

Example is

                  evaluate:
                      cab: aimfast
                      params:
                          image: "{previous.restored}"
                          dirty: "{steps.image.dirty}"

previous refers to parameters of the previous step, steps.image refers to parameters of the named previous step (happens to be the same thing in this case, I just use both to demonstrate), recipe refers to parent recipe parameters. There is also self.foo to refer to own parameters.

I have gotten rid of the unsatisfying maps_to thingy, and instead have an explicit aliases section in the recipe. An alias is different from a substitution because it inherits the full schema of the aliased parameter (whereas a substitution is just a "dumb" sting evaluation). However, I'd still like a way to override defaults. E.g. in the below, recipe.telescope is an alias for tel of the simms cab, which has no default in the cab definition, but I'd like for the recipe to provide a default.

  aliases:
    msname: selfcal.ms
    telescope: makems.tel
  defaults:
    telescope: kat-7

I've done it with a defaults section, but open to better suggestions, since this is not completely consistent with how cabs define defaults. Or, perhaps this is how cabs should define defaults too? That way all the default settings are visible in one place of the yml file, as opposed to scattered throughout.

3 replies

SpheMakh Apr 19, 2021
Maintainer

I'd still like a way to override defaults. E.g. in the below, recipe.telescope is an alias for tel of the simms cab, which has no default in the cab definition, but I'd like for the recipe to provide a default.

This is nice

SpheMakh Apr 19, 2021
Maintainer

've done it with a defaults section, but open to better suggestions since this is not completely consistent with how cabs define defaults. Or, perhaps this is how cabs should define defaults too? That way all the default settings are visible in one place of the yml file, as opposed to scattered throughout.

This sounds good to me.

o-smirnov Apr 19, 2021
Maintainer Author

Yeah that idea is growing on me. I shall make it thus in my next commit.

This also neatly resolves our old issue of how to indicate an truly unset (missing) value. default=None was a little unsatisfying, because what if we actually want a legit value of None somewhere? This way it's all nice and clear -- if it's not in the defaults section, it's unset.

o-smirnov · 2021-04-22T08:20:34Z

o-smirnov
Apr 22, 2021
Maintainer Author

@SpheMakh, I think this is becoming ready for action. I volunteer to implement a simple command-line runner, and start testing this with my experimental selfcal workflows. I'll leave the container runners to you.

Getting on a flight now so maybe that's what I'll amuse myself with.

2 replies

o-smirnov Apr 25, 2021
Maintainer Author

@SpheMakh do another sync-up please. You will also need the configuratt branch of scabha. Points to consider:

I've moved some of the validation and parameter handling functionality to scabha. I did this because I realized that scabha's function inside a container (collecting arguments from a dictionary, putting together a command line, launching the binary, taking care of logfiles etc.) is almost exactly the same thing that the non-container runner in Stimela needs to be doing. So might as well give scabha this job full-time, and import it into Stimela. Container images still only need to install scabha.
I've moved default parameter values into a separate defaults section. The test recipe looks and works a helluva lot neater now.
I've cleaned up the shadems cab definition, removing unnecessary filler
There's now a separate policies section within each parameter schema which defines how a parameter is turned into a command-line argument (prefix, repeat_policy, etc.) There's now a format feature in there which can take of tricky cases such as wsclean's image size argument. The cab itself also has a policies section, which defines the default policies that apply to all parameter schema (but which individual parameter schemas can then override).
Let's discuss the alias thing in the schema. I think it's inside-out right now. The current wsclean.yaml says e.g.

  name:
    info: Prefix of output products
    dtype: str
    alias: prefix
    required: true

I think what you're actually trying to achieve here is for the cab to have a parameter called prefix, which is mapped to the command-line argument -name underneath, correct? If so, I think it should read:

  prefix:
    info: Prefix of output products
    dtype: str
    alias: name
    required: true

o-smirnov Apr 26, 2021
Maintainer Author

All right, check out the latest commit. I've implemented a "native" cab runner (can even do a per-cab virtualenv if needed) and this actually works (see tests/test_cubical_recipe.py). I'm now ready to throw this in anger at my Jupiter problem.

This is now ready for you to put the container running machinery into kitchen/runners.py, and once that's done, this beast is almost feature-complete.

o-smirnov · 2021-05-03T12:57:39Z

o-smirnov
May 3, 2021
Maintainer Author

So this actually works now. For-loops, and a recipe library:

lib:
  recipes:
    cubical_image:
      name: "cubical_image"
      info: 'does one step of cubical, followed by one step of imaging'
      dirs:
        log: logs
      aliases:
        ms: [calibrate.ms, image.ms]
      steps: 
        calibrate: 
            cab: cubical
        image:
            cab: myclean


recipe:
  name: "test loop"
  for_loop:
    var: ms
    over: ms_list
  aliases:
    ms: ['cubical_image.ms']
  inputs:
    ms_list:
      dtype: List[str]
      required: true
  defaults:
    ms_list: [a,b,c]
  steps:
    cubical_image:
      recipe:
        _use: lib.recipes.cubical_image

0 replies

Recipes of recipes, and turtles all the way down #698

o-smirnov Feb 10, 2021 Maintainer

Desirata

Basic types

Recipe

Recipe parameters

Cross-referencing parameters

Enabling and conditionals

Looping!

Replies: 9 comments · 24 replies

o-smirnov Feb 10, 2021 Maintainer Author

o-smirnov Feb 12, 2021 Maintainer Author

o-smirnov Feb 12, 2021 Maintainer Author

SpheMakh Feb 12, 2021 Maintainer

o-smirnov Feb 12, 2021 Maintainer Author

SpheMakh Feb 12, 2021 Maintainer

SpheMakh Feb 12, 2021 Maintainer

Dealing with recipe outputs

o-smirnov Feb 12, 2021 Maintainer Author

o-smirnov Feb 12, 2021 Maintainer Author

SpheMakh Feb 12, 2021 Maintainer

o-smirnov Feb 12, 2021 Maintainer Author

o-smirnov Feb 12, 2021 Maintainer Author

o-smirnov Feb 13, 2021 Maintainer Author

o-smirnov Feb 13, 2021 Maintainer Author

o-smirnov Feb 13, 2021 Maintainer Author

SpheMakh Feb 14, 2021 Maintainer

o-smirnov Feb 14, 2021 Maintainer Author

o-smirnov Feb 14, 2021 Maintainer Author

SpheMakh Feb 15, 2021 Maintainer

What's you type

Basic

I/O

List

Dict

SpheMakh Feb 15, 2021 Maintainer

o-smirnov Feb 15, 2021 Maintainer Author

SpheMakh Feb 15, 2021 Maintainer

o-smirnov Apr 18, 2021 Maintainer Author

SpheMakh Apr 19, 2021 Maintainer

SpheMakh Apr 19, 2021 Maintainer

o-smirnov Apr 19, 2021 Maintainer Author

o-smirnov Apr 22, 2021 Maintainer Author

o-smirnov Apr 25, 2021 Maintainer Author

o-smirnov Apr 26, 2021 Maintainer Author

o-smirnov May 3, 2021 Maintainer Author

o-smirnov
Feb 10, 2021
Maintainer

Replies: 9 comments 24 replies

o-smirnov
Feb 10, 2021
Maintainer Author

o-smirnov Feb 12, 2021
Maintainer Author

o-smirnov Feb 12, 2021
Maintainer Author

SpheMakh Feb 12, 2021
Maintainer

o-smirnov Feb 12, 2021
Maintainer Author

SpheMakh Feb 12, 2021
Maintainer

SpheMakh
Feb 12, 2021
Maintainer

o-smirnov Feb 12, 2021
Maintainer Author

o-smirnov Feb 12, 2021
Maintainer Author

SpheMakh Feb 12, 2021
Maintainer

o-smirnov Feb 12, 2021
Maintainer Author

o-smirnov Feb 12, 2021
Maintainer Author

o-smirnov
Feb 13, 2021
Maintainer Author

o-smirnov
Feb 13, 2021
Maintainer Author

o-smirnov
Feb 13, 2021
Maintainer Author

SpheMakh Feb 14, 2021
Maintainer

o-smirnov Feb 14, 2021
Maintainer Author

o-smirnov Feb 14, 2021
Maintainer Author

SpheMakh
Feb 15, 2021
Maintainer

SpheMakh Feb 15, 2021
Maintainer

o-smirnov Feb 15, 2021
Maintainer Author

SpheMakh Feb 15, 2021
Maintainer

o-smirnov
Apr 18, 2021
Maintainer Author

SpheMakh Apr 19, 2021
Maintainer

SpheMakh Apr 19, 2021
Maintainer

o-smirnov Apr 19, 2021
Maintainer Author

o-smirnov
Apr 22, 2021
Maintainer Author

o-smirnov Apr 25, 2021
Maintainer Author

o-smirnov Apr 26, 2021
Maintainer Author

o-smirnov
May 3, 2021
Maintainer Author