SLURM job fails from controller running targets
pipeline - Timed out
#37
Replies: 1 comment 5 replies
-
That timeout error can happen if the You might also have better luck upgrading to the CRAN versions of
If you are using the latest |
Beta Was this translation helpful? Give feedback.
-
I'm trying to develop a working example/template for doing HPC using
targets
andcrew.cluster
to runtargets
pipelines on a SLURM cluster. I've been able to get it sort of working, but I'm running into a few problems I'm hoping I can ask for input on?the code
First, here's the pipeline, that builds off of the simple example in the
targets
docs, but extends it to just use two different types of controllers in a way that I think is similar to how most people's pipelines might look:The functions file is here:
problem
I've been able to get the controller to send the jobs to the SLURM launcher (the job actually launches), and the pipeline begins to run (I can see it saves the output from the
plot
target, but it always fails after what (I suspect) is the end of the targets that use the smaller worker. Here's the output from the R console I open in the project directory and launch the pipeline from (withtar_make()
):I can't seem to figure out where the
target NA error: 'errorValue' int 5 | Timed out
? I have tried adding a command into the controllers forseconds_timeout = 60
but that doesn't seem to change anything.The larger worker seems to actually launch, it does get queued by
dispatched target big
, but for some reason as soon as the smaller worker fails it gets cancelled, OR it launches but runs for like 30 seconds and fails with error 143 which corresponds to the Linux signalSIGTERM
, which makes me think somehow the targets pipeline itself is killing it?The output of the error files for both the workers aren't super interpret-able (unless I'm missing something?), here's the one from the smaller worker:
Anyways, I'm curious if there's anything obvious I'm doing here that's causing the small worker to fail even though it does successfully move through the pipeline as far as I can tell.
Any help very appreciated 🙏
Beta Was this translation helpful? Give feedback.
All reactions