Running sensitivity analyses #458

wjchulme · 2021-10-11T14:05:29Z

wjchulme
Oct 11, 2021
Maintainer

This is a question about a specific problem, but the list of potential solutions could be helpful for lots of different sensitivity analysis type scenarios.

Problem

My study runs analyses across a cohort of people who were vaccinated within a specific date range. The dates are declared in the "design" script, which are then used throughout the codebase.

Let's say I want to do a sensitivity analysis where I use a different start date. I'm a sucker for softcoding, so it's literally a one-line commit to re-run the whole thing with a different start date. But this will overwrite the outputs using the old start date, which I don't want to do as it's a sensitivity analysis, not a redesign of the study.

Solutions?

I can think of 4 different approaches to do this.

Archiving

Copy the outputs using the original start date into a new "original" folder, update the start date, re-run, and put the new outputs in a "sensitivity" folder. Ideally include a pointer to the state of the original/sensitivity commit in each folder.

This far from ideal, reproducibility wise. And the opensafely platform makes it a bit trickier due to the outputs being on the server -- they'd need to be archived on the server, or released and then archived, so it's means needed L3 sever access at least once for each sensitivity analysis. It's maybe ok for a one-off, and it's easy to do, but it doesn't scale well at all.

Worth pointing out that in general, this practice of tweaking code, re-running, and archiving the outputs is a common workflow pattern for researchers.

Branching

Create a new branch with the new start date, make a new workspace based on that branch, and run. Also maybe ok for a one-off, and can be set up and run completely offline (doesn't need more than one trip to L3).

Unlike the archiving approach, I now have two (or more) branches that need to remain in synch if ever the two analyses need to be re-run (though I suppose merging an updated main branch onto the sensitivity branch isn't so difficult). Also doesn't scale well.

Parameterise using `project.yaml`

A more formal way to run the study across different start dates is to parameterise the script. Instead of specifying the start date as a fixed element of the study design, I pass it into the script via the project.yaml, and create separate actions for each script that needs to run using different start dates. These scripts have now acquired an additional start_date argument, and their associated actions have gone from, for example, model_plr_postest to model_plr_postest_4jan and model_plr_postest_13jan.

In this specific example, it requires quite a big refactor of the codebase even just for a very simple design change. Almost all scripts will need the new start date argument, which will be time-consuming to update and test, and it almost doubles the size of my https://github.com/opensafely/comparative-ve-research/blob/e8acbb525032172f615bd043dd60b6fdd772f027/project.yaml, though this is less of a problem given that the yaml actions are programmed.

Another problem is that many of the original actions have been renamed (model_plr_postest to model_plr_postest_4jan), which means that the job runner thinks that they haven't been run (it sees them as new actions, not renamed actions). So I either re-run them all (which takes time) or somehow trick job runner into recognising the existing outputs using the original start date as belonging to the renamed actions (which means mucking about with the ./metadata/manifest.json file on the server, and requires a bit of tech support).

This approach might be made a bit neater if the parameterised jobs feature was implemented, but it still requires a refactor of the codebase.

Parameterise using a loop within the script

Within each script, create a loop that runs the code across the different start dates, saving outputs to a start-date-specific file/folder. The project.yaml doesn't need any new actions, rather each action now produces more outputs.
The actions would need to be re-run, which would also mean re-running over the the original start date even though those outputs already exist (unless there's some more manifest.json magic).

This isn't the right approach here, but I'm mentioning it as it might be for other sensitivity analyses.

Which is best?

None of the above approaches are ideal. parameterise using project.yaml is a bit cumbersome for this particular example and needs tech team input, but is probably the best approach for now.

Or am I missing a trick?

sebbacon · 2021-11-01T09:58:21Z

sebbacon
Nov 1, 2021
Maintainer

It seems to me that the paramaterise option is the correct one.

Another problem is that many of the original actions have been renamed (model_plr_postest to model_plr_postest_4jan), which means that the job runner thinks that they haven't been run (it sees them as new actions, not renamed actions)

Given there's no way of knowing for sure if the outputs of a script would change if one of its parameters were different, I guess this exposes a limitation of our "don't rerun / always rerun" logic as currently implemented.

@evansd and I had a discussion in about April 2020 about how the idea of a "workspace" should have copy and paste semantics available to a researcher - so you could say "treat this file as the output of that run".

I think this might be the best solution longer term - I think the reproducibility angle should then be addressed by a "clean run" feature which we expect all studies to do prior to publication - so we can assert that all final outputs have been generated from a known state.

0 replies

wjchulme · 2022-03-31T09:09:00Z

wjchulme
Mar 31, 2022
Maintainer Author

This PR demonstrates a quick way to introduce a new argument to an existing script/action to run a sensitivity analysis. There's no need to rename, or rewrite existing actions, or over-write their outputs opensafely/comparative-ve-research#18.

This suggests a fourth option not covered in the OP. Essentially, allow for the possibility that actions may take unplanned additional arguments, which are used to perform sensitivity analyses and direct outputs to a different location. The PR linked above is based on a codebase that hadn't anticipated this usage, so it's quite hacky. But if outputs are organised in a main/ versus sensitivity/ type folder structure from the start, then it only requires adding some if-else logic to the script, based on the values (/existence of) the additional arguments.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Running sensitivity analyses #458

{{title}}

Replies: 2 comments

{{title}}

{{title}}

Select a reply

Running sensitivity analyses #458

wjchulme Oct 11, 2021 Maintainer

Problem

Solutions?

Archiving

Branching

Parameterise using project.yaml

Parameterise using a loop within the script

Which is best?

Replies: 2 comments

sebbacon Nov 1, 2021 Maintainer

wjchulme Mar 31, 2022 Maintainer Author

wjchulme
Oct 11, 2021
Maintainer

Parameterise using `project.yaml`

sebbacon
Nov 1, 2021
Maintainer

wjchulme
Mar 31, 2022
Maintainer Author