Shadow Directory for "bidsapp" mode #179

pvandyken · 2022-07-12T20:22:43Z

pvandyken
Jul 12, 2022
Maintainer

Been working with snakemake shadow dirs lately, and was wondering if this idea might find useful application for snakebids bidsapps. I've spent a bit of time thinking it out, and it's definitely not a "no-brainer let's do it", as it comes with one or two critical consequences for app design, but I thought I'd lay it out as I see it.

It's already been discussed and decided that in bidsapp mode, the snakemake working directory should be changed to the user's output dir (see #61). The problem created here is that paths relative to the snakemake directory can no longer be resolved. The current workaround is for developers to use workflow.basedir + "path" or an equivalent every time they reference a path in their workflow.

However, given that results, config, and .snakemake are all already privileged folder names in a snakemake app, it should be possible to do a "shallow shadow" of the snakemake directory, meaning a symlink to every top level folder and file. Along with these snakemake folders, symlinks for results, config, and .snakemake would point to the user's output directory. So the shadow dir would look something like this, where ---> points to a symlink destination:

shadow_dir
# Snakemake folders
├── resources ---> snakemake_dir/resources
├── workflow ---> snakemake_dir/workflow
├── foo.bar ---> snakemake_dir/foo.bar
...
# Output folders
├── results ---> user_output_dir
├── config ---> user_output_dir/config (or user_output_dir/code/app_name)
└── .snakemake ---> user_output_dir/.snakemake

This approach offers two primary advantages:

Anything in the snakemake directory can be accessed by simple relative paths. No more need for workflow.basedir, a semi-frequent source of confusion.
As far as symlinking approaches to this problem go, this is the cleanest possible way. If the app is interrupted, no extra garbage is left behind (the shadow_dir would be in a tmpdir and will get cleaned by the system).

The primary disadvantage is that any root level files or directories created in the workflow will be saved to the shadow_dir and will not persist. Two obvious examples would be the logs/ and benchmarks/ folder, both of which should be saved to the output directory. To work around this, we would have to register any such folders in the run.py file so that Snakebids can make the relevant symlinks ahead of time. The API might look something like this:

# "bind" is used in the sense of a singularity bind: the 
# `logs` and `benchmarks` folders would be bound to identically
# named files in the output folder.
SnakeBidsApp("path/to/snakemake", bind=["logs", "benchmarks"]).run_snakemake()

So it's trading one annoyance for another. In defense of the "binding" approach described above, it centralizes the workflow modifications to one line. One no longer needs to remember to use workflow.basedir for every relative path in the workflow. In fact, arguably, a well-behaved workflow will only output files to the results (via config['output_dir']), logs, and benchmarks dirs, and the above line could be put in the boilerplate app, so friction could be reduced to nearly 0 for devs without special needs.

On the other hand, it increases the complexity of Snakebids (more things that can break, etc), and the effort required to implement is perhaps not worth the potential payoff. So I'm not completely sold on the idea.

Alternative Approach

One similar (but less good I think) idea in the same vein of "binding" or "registration" is to allow devs to make symlinks to root level snakemake_dir directories in the output directory. For example, one could "bind" resources and that would make a resources symlink right in the output_dir which would point to the resources directory in the snakemake dir. That way, any relative paths in the workflow pointing into resources would continue to work. When the app finishes, these symlinks would be deleted.

The problem here is that should the app get interrupted, the symlinks would remain in the output_dir as extra garbage. This approach is thus not clean like the above.

pvandyken · 2022-07-13T02:24:55Z

pvandyken
Jul 13, 2022
Maintainer Author

One other point in favour of the shadow dir approach comes from Snakemake's new provenance behaviour. If workflow.basedir() is used in any of params, input, or shell, and the basedir changes across runs, Snakemake will trigger a rerun of affected rule, even though nothing important has actually changed. The only way around it would be to constantly --clear-metadata or disable all those --rerun-triggers.

This is especially relevant for folks installing their snakebids apps to temporary, local scratch directories, where the paths may change constantly. I imagine this usecase is not uncommon, especially on the cluster. And ideally, our official policy would not be that "for this app to work, this snakemake feature must be disabled". I would thus argue that even if we don't go with the shadow dir, we should find another solution to this. (I'll raise a dedicated issue for this in case there's any other ideas)

0 replies

tkkuehn · 2022-07-15T15:33:39Z

tkkuehn
Jul 15, 2022
Maintainer

Thanks for writing this up! It seems like a reasonable avenue to explore but I do think the details will matter a lot.

I think solving the provenance issue is the most important impact of this proposal. We certainly don't want users to need to disable useful snakemake features to use Snakebids, and we don't want useless reruns as a result of our architecture.

I think the increased complexity is something we'll need to work hard to address if we implement this. The existence (and location) of a temporary shadow dir will probably not be obvious to an inexperienced user and will make workflow debugging a lot harder in the absence of good logs. My wishlist on this topic would be (in order of implementation complexity):

Clear documentation of everything we do behind the scenes when you run a Snakebids workflow.
More granular (maybe gated behind a verbosity CLI arg) logs of everything that happens in the run-up to calling snakemake (i.e. parsing command line arts, creating a shadow dir, what's getting symlinked and whether it's due to a bind).
A warning if a file or unbound directory is written to the shadow dir.
Some kind of static analysis of a Snakefile to warn if unbound files/directories are being made at the top level.

A couple of other questions (that may reveal errors in how I'm understanding or thinking about this proposal):

Would it be useful to add the input bids_dir to the shadow dir (as input/ or something)?
Is there any reason other than consistency with some Snakemake docs to bind docs and benchmarks to the shadow dir instead of encouraging users to put them under results?

1 reply

pvandyken Jul 15, 2022
Maintainer Author

Is there any reason other than consistency with some Snakemake docs to bind docs and benchmarks to the shadow dir instead of encouraging users to put them under results?

Sometimes I get confused between best practice and my practice, and this is such a case. There's no official Snakemake recommendation for "logs" and "benchmarks", that was something I made up a long time ago and got used to. My apologies, and thanks for drawing attention to that.

pvandyken · 2022-07-22T15:31:10Z

pvandyken
Jul 22, 2022
Maintainer Author

I came across a good reason not to use shadow dirs: Snakemake has a number of flags allowing the output of various secondary outputs or the configuration of in-app parameters (e.g. --report, --shadow-prefix, --stats, etc). Being provided by the user, these paths are unpredictable, and if any are provided as a relative path (a fairly natural thing to do especially for args like --report), they will not persist past the shadow dir.

I can see two primary workarounds:

Document that users should use absolute paths. This is a bad solution in my opinion: it relies on users reading the entire documentation and needing to put absolute paths for these types of args is strange.
Expect snakebids app developers to build in support for these snakemake args into their apps. For instance, the dev could add their own --report var to the app and absolutize the path provided. While this type of solution may be worthwhile on its own merit (see below for additional weirdness it would solve), it carries two problems:
a. It places more burden on app developers, and makes it harder to take advantage of all that snakemake has to offer (of course, we can help set defaults on the snakemake side, e.g. making a --report module available)
b. It will be quite difficult to support all the valid snakemake arguments that take paths across all versions of snakemake. Some args don't exist in old versions, new ones are occasionally added, etc

Of course, this is also the source of some current weirdness: right now if you make a report, the path will be relative to whatever output dir you provide, which is unintuitive. But that behaviour, at least, can be documented.

All in all, this issue seems critical and I don't currently see a way around it. But I'd love to hear other ideas that might resolve it.

0 replies

kaitj · 2022-07-22T17:01:09Z

kaitj
Jul 22, 2022
Maintainer

Ahh, this reminded me a conservation about VTK versions at ohbm... 😅 On that note, for the short term, do we need to temporarily constrain the Snakemake version to avoid any potential issues (not sure what version this was introduced)? I can understand why we wouldn't want to constrain versions, but I also think this would give us time to implement any features / fixes as necessary for a given version.

To your first point, I agree - don't think we should enforce absolute paths. I think we had similar discussions when implementing pybids.

Just for clarification, you mentioned the new provenance a couple of posts ago with workflow.basedir() - just trying to think of when would this basedir would change?

1 reply

pvandyken Jul 22, 2022
Maintainer Author

It would change whenever the snakemake app is reinstalled to a new directory. It's especially relevant for pip packaged apps installed on localscratch on the cluster

tkkuehn · 2022-07-25T20:14:22Z

tkkuehn
Jul 25, 2022
Maintainer

Ah, good catch that there are Snakemake-provided ways to introduce files relative to the working directory that aren't known by the workflow developer ahead of time.

I forget whether anything like this was ever suggested, but maybe we could inspect the shadow directory, create a directory {output_dir}/snakebids_unbound_files, and move any files or directories to that directory, logging a warning that we've done this. I don't love this either but lets the shadow directory go forth with a minimum of fuss for app developers (but probably some confusion for app users). This may also not work for some reason I'm not thinking of.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Shadow Directory for "bidsapp" mode #179

{{title}}

Replies: 5 comments 2 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Shadow Directory for "bidsapp" mode #179

pvandyken Jul 12, 2022 Maintainer

Alternative Approach

Replies: 5 comments · 2 replies

pvandyken Jul 13, 2022 Maintainer Author

tkkuehn Jul 15, 2022 Maintainer

pvandyken Jul 15, 2022 Maintainer Author

pvandyken Jul 22, 2022 Maintainer Author

kaitj Jul 22, 2022 Maintainer

pvandyken Jul 22, 2022 Maintainer Author

tkkuehn Jul 25, 2022 Maintainer

pvandyken
Jul 12, 2022
Maintainer

Replies: 5 comments 2 replies

pvandyken
Jul 13, 2022
Maintainer Author

tkkuehn
Jul 15, 2022
Maintainer

pvandyken Jul 15, 2022
Maintainer Author

pvandyken
Jul 22, 2022
Maintainer Author

kaitj
Jul 22, 2022
Maintainer

pvandyken Jul 22, 2022
Maintainer Author

tkkuehn
Jul 25, 2022
Maintainer