Replies: 2 comments
-
It seems to me that the paramaterise option is the correct one.
Given there's no way of knowing for sure if the outputs of a script would change if one of its parameters were different, I guess this exposes a limitation of our "don't rerun / always rerun" logic as currently implemented. @evansd and I had a discussion in about April 2020 about how the idea of a "workspace" should have copy and paste semantics available to a researcher - so you could say "treat this file as the output of that run". I think this might be the best solution longer term - I think the reproducibility angle should then be addressed by a "clean run" feature which we expect all studies to do prior to publication - so we can assert that all final outputs have been generated from a known state. |
Beta Was this translation helpful? Give feedback.
-
This PR demonstrates a quick way to introduce a new argument to an existing script/action to run a sensitivity analysis. There's no need to rename, or rewrite existing actions, or over-write their outputs opensafely/comparative-ve-research#18. This suggests a fourth option not covered in the OP. Essentially, allow for the possibility that actions may take unplanned additional arguments, which are used to perform sensitivity analyses and direct outputs to a different location. The PR linked above is based on a codebase that hadn't anticipated this usage, so it's quite hacky. But if outputs are organised in a |
Beta Was this translation helpful? Give feedback.
-
This is a question about a specific problem, but the list of potential solutions could be helpful for lots of different sensitivity analysis type scenarios.
Problem
My study runs analyses across a cohort of people who were vaccinated within a specific date range. The dates are declared in the "design" script, which are then used throughout the codebase.
Let's say I want to do a sensitivity analysis where I use a different start date. I'm a sucker for softcoding, so it's literally a one-line commit to re-run the whole thing with a different start date. But this will overwrite the outputs using the old start date, which I don't want to do as it's a sensitivity analysis, not a redesign of the study.
Solutions?
I can think of 4 different approaches to do this.
Archiving
Copy the outputs using the original start date into a new "original" folder, update the start date, re-run, and put the new outputs in a "sensitivity" folder. Ideally include a pointer to the state of the original/sensitivity commit in each folder.
This far from ideal, reproducibility wise. And the opensafely platform makes it a bit trickier due to the outputs being on the server -- they'd need to be archived on the server, or released and then archived, so it's means needed L3 sever access at least once for each sensitivity analysis. It's maybe ok for a one-off, and it's easy to do, but it doesn't scale well at all.
Worth pointing out that in general, this practice of tweaking code, re-running, and archiving the outputs is a common workflow pattern for researchers.
Branching
Create a new branch with the new start date, make a new workspace based on that branch, and run. Also maybe ok for a one-off, and can be set up and run completely offline (doesn't need more than one trip to L3).
Unlike the archiving approach, I now have two (or more) branches that need to remain in synch if ever the two analyses need to be re-run (though I suppose merging an updated
main
branch onto thesensitivity
branch isn't so difficult). Also doesn't scale well.Parameterise using
project.yaml
A more formal way to run the study across different start dates is to parameterise the script. Instead of specifying the start date as a fixed element of the study design, I pass it into the script via the project.yaml, and create separate actions for each script that needs to run using different start dates. These scripts have now acquired an additional
start_date
argument, and their associated actions have gone from, for example,model_plr_postest
tomodel_plr_postest_4jan
andmodel_plr_postest_13jan
.In this specific example, it requires quite a big refactor of the codebase even just for a very simple design change. Almost all scripts will need the new start date argument, which will be time-consuming to update and test, and it almost doubles the size of my https://github.com/opensafely/comparative-ve-research/blob/e8acbb525032172f615bd043dd60b6fdd772f027/project.yaml, though this is less of a problem given that the yaml actions are programmed.
Another problem is that many of the original actions have been renamed (
model_plr_postest
tomodel_plr_postest_4jan
), which means that the job runner thinks that they haven't been run (it sees them as new actions, not renamed actions). So I either re-run them all (which takes time) or somehow trick job runner into recognising the existing outputs using the original start date as belonging to the renamed actions (which means mucking about with the./metadata/manifest.json
file on the server, and requires a bit of tech support).This approach might be made a bit neater if the parameterised jobs feature was implemented, but it still requires a refactor of the codebase.
Parameterise using a loop within the script
Within each script, create a loop that runs the code across the different start dates, saving outputs to a start-date-specific file/folder. The
project.yaml
doesn't need any new actions, rather each action now produces more outputs.The actions would need to be re-run, which would also mean re-running over the the original start date even though those outputs already exist (unless there's some more
manifest.json
magic).This isn't the right approach here, but I'm mentioning it as it might be for other sensitivity analyses.
Which is best?
None of the above approaches are ideal. parameterise using
project.yaml
is a bit cumbersome for this particular example and needs tech team input, but is probably the best approach for now.Or am I missing a trick?
Beta Was this translation helpful? Give feedback.
All reactions