Crew worker terminating itself for unknown reasons #176

multimeric · 2024-07-30T06:54:04Z

multimeric
Jul 30, 2024

I'm using crew + crew.cluster via targets. In my pipeline, I have a specific target that consistently crashes the worker, although I can't work out why. I would like to resolve this issue so that the targets pipeline can finish. It doesn't seem to be memory related, as I've set an rlimit/ulimit which is never exceeded. For this reason, I also don't think that Slurm, my job scheduler, is killing the job.

In targets, I just get the normal "the worker has died" message:

Error:
! Error running targets::tar_make()
Error messages: targets::tar_meta(fields = error, complete_only = TRUE)
Debugging guide: https://books.ropensci.org/targets/debugging.html
How to ask for help: https://books.ropensci.org/targets/help.html
Last error message:
    'errorValue' int 5 | 5 | Timed out
 The {mirai} dispatcher is not running. If you are using {crew} without {targets}, be sure to call the start() method of the controller before doing anything else. If you already did, or if you are using {targets}, then the dispatcher process probably started and then failed.

 Please also try upgrading R packages {nanonext}, {mirai}, and {crew} to their latest versions on CRAN. (Likewise with {targets} if you are using it.) Upgrading these packages solves many kinds of errors.

 Another possibility is an out-of-memory error. The dispatcher can run out of memory if it is overwhelmed with data objects too large or too many to comfortably fit inside a single R process. As a workaround, each task could save or load large files instead of sending or returning large objects in R. Those large files could either live locally on disk or in a cloud bucket. If you are using {targets}, you might consider storage = "worker", retrieval = "worker", and/or cloud storage as documented at https://books.ropensci.org/targets/performance.html and https://books.ropensci.org/targets/cloud-storage.html.
Last error traceback:
    base::tryCatch(base::withCallingHandlers({ NULL base::saveRDS(base::do.c...
    tryCatchList(expr, classes, parentenv, handlers)
    tryCatchOne(tryCatchList(expr, names[-nh], parentenv, handlers[-nh]), na...
    doTryCatch(return(expr), name, parentenv, handler)
    tryCatchList(expr, names[-nh], parentenv, handlers[-nh])
    tryCatchOne(expr, names, parentenv, handlers[[1L]])
    doTryCatch(return(expr), name, parentenv, handler)
    base::withCallingHandlers({ NULL base::saveRDS(base::do.call(base::do.ca...
    base::saveRDS(base::do.call(base::do.call, base::c(base::readRDS("/vast/...
    base::do.call(base::do.call, base::c(base::readRDS("/vast/scratch/users/...
    (function (what, args, quote = FALSE, envir = parent.frame()) { if (!is....
    (function (targets_function, targets_arguments, options, envir = NULL, s...
    tryCatch(out <- withCallingHandlers(targets::tar_callr_inner_try(targets...
    tryCatchList(expr, classes, parentenv, handlers)
    tryCatchOne(expr, names, parentenv, handlers[[1L]])
    doTryCatch(return(expr), name, parentenv, handler)
    withCallingHandlers(targets::tar_callr_inner_try(targets_function = targ...
    targets::tar_callr_inner_try(targets_function = targets_function, target...
    do.call(targets_function, targets_arguments)
    (function (pipeline, path_store, names_quosure, shortcut, reporter, seco...
    crew_init(pipeline = pipeline, meta = meta_init(path_store = path_store)...
    self$run_crew()
    self$iterate()
    self$conclude_worker_task()
    self$controller$pop(scale = TRUE, throttle = TRUE)
    .subset2(self, "scale")(throttle = throttle)
    private$.launcher$scale(demand = self$unresolved(), throttle = throttle)
    self$tally()
    daemons %|||% daemons_info(name = private$.name, seconds_interval = priv...
    daemons_info(name = private$.name, seconds_interval = private$.seconds_i...
    if_any(valid, daemons, daemons_error(daemons, name))
    daemons_error(daemons, name)
    crew_error(paste(message, info))
    crew_stop(message = message, class = c("crew_error", "crew"))
    rlang::abort(message = message, class = class, call = emptyenv())
    signal_abort(cnd, .file)

In the Slurm stdout/stderr log, the only info I get is:

/var/spool/slurmd/job17490501/slurm_script: line 7: 28977 Terminated

I ran strace on the relevant R process (PID 28977), and I can see that it is actually terminating itself (si_pid=28977). Why might crew (or some other part of the stack, like mirai, nanonext etc) be doing this?

Here are the last lines from the strace:

mremap(0x2b7634523000, 106434662400, 106568880128, MREMAP_MAYMOVE) = 0x2b7634523000
mremap(0x2b7634523000, 106568880128, 106703097856, MREMAP_MAYMOVE) = 0x2b7634523000
mremap(0x2b7634523000, 106703097856, 106837315584, MREMAP_MAYMOVE) = 0x2b7634523000
mremap(0x2b7634523000, 106837315584, 106971533312, MREMAP_MAYMOVE) = 0x2b7634523000
mremap(0x2b7634523000, 106971533312, 107105751040, MREMAP_MAYMOVE) = 0x2b7634523000
mremap(0x2b7634523000, 107105751040, 107239968768, MREMAP_MAYMOVE) = 0x2b7634523000
mremap(0x2b7634523000, 107239968768, 107374186496, MREMAP_MAYMOVE) = 0x2b7634523000
mremap(0x2b7634523000, 107374186496, 107508404224, MREMAP_MAYMOVE) = 0x2b7634523000
mremap(0x2b7634523000, 107508404224, 107642621952, MREMAP_MAYMOVE) = 0x2b7634523000
mremap(0x2b7634523000, 107642621952, 107776839680, MREMAP_MAYMOVE) = 0x2b7634523000
mmap(NULL, 107648081920, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x2b8f4c524000
read(13, "\363\365\213\35", 4)          = 4
read(13, "\275/\31\316", 4)             = 4
sendmsg(14, {msg_name=NULL, msg_namelen=0, msg_iov=[{iov_base="\27\3\3\0\36\\\331\366\330\351\357\366n\207>\25(\310z\227_\242I\25z+nv\276\7.\5"..., iov_len=35}], msg_iovlen=1, msg_controllen=0, msg_flags=0}, MSG_NOSIGNAL) = 35
futex(0x403a654, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x403a650, FUTEX_OP_SET<<28|0<<12|FUTEX_OP_CMP_GT<<24|0x1) = 1
futex(0x403a628, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x403a654, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x403a650, FUTEX_OP_SET<<28|0<<12|FUTEX_OP_CMP_GT<<24|0x1) = 1
futex(0x403a628, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x70f05c8, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x2b5cf80097a8, FUTEX_WAKE_PRIVATE, 1) = 1
munmap(0x2b7634523000, 107776839680)    = 0
read(13, "$\10\376\257", 4)             = 4
futex(0x4001514, FUTEX_WAIT_PRIVATE, 3, NULL) = 0
--- SIGTERM {si_signo=SIGTERM, si_code=SI_USER, si_pid=28977, si_uid=3119} ---
+++ killed by SIGTERM +++

Additional diagnostics here:

> sessionInfo()
R version 4.4.0 (2024-04-24)
Platform: x86_64-pc-linux-gnu
Running under: CentOS Linux 7 (Core)

Matrix products: default
BLAS:   /stornext/System/data/apps/R/R-4.4.0/lib64/R/lib/libRblas.so 
LAPACK: /stornext/System/data/apps/R/R-4.4.0/lib64/R/lib/libRlapack.so;  LAPACK version 3.12.0

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8     LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                  LC_ADDRESS=C               LC_TELEPHONE=C             LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

time zone: UTC
tzcode source: system (glibc)

attached base packages:
[1] stats     graphics  grDevices datasets  utils     methods   base     

loaded via a namespace (and not attached):
 [1] vctrs_0.6.5         cli_3.6.2           knitr_1.46          rlang_1.1.3         xfun_0.43           processx_3.8.4      targets_1.7.1.9004  promises_1.3.0     
 [9] renv_1.0.7          data.table_1.15.4   nanonext_1.1.1      glue_1.7.0          mirai_1.1.1         backports_1.4.1     ps_1.7.6            crew_0.9.5         
[17] fansi_1.0.6         tibble_3.2.1        base64url_1.4       yaml_2.3.8          lifecycle_1.0.4     BiocManager_1.30.22 compiler_4.4.0      codetools_0.2-20   
[25] igraph_2.0.3        Rcpp_1.0.12         pkgconfig_2.0.3     later_1.3.2         getip_0.1-4         rstudioapi_0.16.0   R6_2.5.1            tidyselect_1.2.1   
[33] utf8_1.2.4          pillar_1.9.0        callr_3.7.6         magrittr_2.0.3      tools_4.4.0         withr_3.0.0         secretbase_0.5.0   
> packageVersion("targets")
[1] ‘1.7.1.9004’
> packageVersion("crew")
[1] ‘0.9.5’
> packageVersion("crew.cluster")
[1] ‘0.3.2’
> packageVersion("mirai")
[1] ‘1.1.1’
> packageVersion("nanonext")
[1] ‘1.1.1’

wlandau · 2024-07-30T15:36:33Z

wlandau
Jul 30, 2024
Maintainer

Interesting. Come to think of it, I remember seeing this in one of my own large pipelines as well a while back.

That trace is a helpful clue, and it seems to point to #141 and shikokuchuo/mirai#87 (comment). To solve #141, each worker terminates itself with SIGTERM when its websocket connection disconnects. This functionality is provided by mirai::daemon(). crew calls mirai::daemon(auto exit = SIGTERM), c.f.

crew/R/crew_launcher.R

Line 466 in e62928e

autoexit = crew_terminate_signal(),

and https://github.com/wlandau/crew/blob/main/R/crew_terminate_signal.R.

@shikokuchuo, is it possible that the autoexit signal is being sent incorrectly? Could it be that mirai or nanonext is reacting to a false positive disconnect?

30 replies

multimeric Aug 2, 2024
Author

Yeah, my example was just faulty I guess. Here is the same plot generated from my actual pipeline:

It's just very tricky to make a reprex out of it.

wlandau Aug 2, 2024
Maintainer

Yeah, I wish I knew why dispatcher memory spikes so much in the full pipeline. If that code is too confidential to post here, would it be okay to share what the graph looks like?

wlandau Aug 2, 2024
Maintainer

As of 32fc13a, I added native memory logging functionality for local processes: https://wlandau.github.io/crew/articles/introduction.html#resources. The easiest way to use this is to set the log_resources argument of the controller to a file path where you want the memory log to live.

I will try and use this to see if I can reproduce something useful on an old Bayesian pipeline.

multimeric Aug 12, 2024
Author

I'm still slowly trying to debug this when I get time. I added your new log_resources argument to my buggy pipeline, and indeed it demonstrated my issue:

The other weird thing that I noticed is that I strace'd the crew worker, not the dispatcher like I did above. After it loads the data, it does nothing except a futex wait:

strace: Process 34477 attached
futex(0x1e14824, FUTEX_WAIT_PRIVATE, 3, NULL) = ?
+++ killed by SIGKILL +++

wlandau Sep 9, 2024
Maintainer

If you can at least share the bones of the pipeline, that would help. To make it into a shareable example, you could remove all the files except _targets.R and replace the command argument of each tar_target() with something simple and non-confidential.

wlandau · 2024-10-10T20:12:41Z

wlandau
Oct 10, 2024
Maintainer

FYI I implemented both local and worker-level memory logging via the new autometric package: https://wlandau.github.io/crew/articles/logging.html

0 replies

wlandau · 2024-11-04T21:11:04Z

wlandau
Nov 4, 2024
Maintainer

@multimeric, is it possible that this original issue was an instance of #189?

3 replies

multimeric Nov 4, 2024
Author

In my case, the same Slurm job remained running, with consistently high memory usage, so I assume not? But I'm happy to try if this or any other change has unblocked this. Should I just update all packages from GitHub?

wlandau Nov 5, 2024
Maintainer

Hmmm looking back at the message, it does seem different from #189. I recall now that the mirai dispatcher is running out of memory and crashing, but the slurm workers are okay.

I have heard reports of the dispatcher error when the pipeline has ~100k targets, and I would like to understand what other kinds of pipelines trigger it.

Is there more you can disclose about how the pipeline is structured, such as the tar_glimpse() graph structure. Is it possible to create a reprex by replacing the confidential steps with calls to Sys.sleep() while retaining the basic structure of the pipeline?

wlandau Nov 5, 2024
Maintainer

Should I just update all packages from GitHub?

Yes, next time you test, please try with the GitHub versions of targets, crew, and crew.cluster. The CRAN versions of mirai/nanonext are in good shape (1.3.0 for both).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Crew worker terminating itself for unknown reasons #176

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 3 comments 33 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

Crew worker terminating itself for unknown reasons #176

multimeric Jul 30, 2024

Replies: 3 comments · 33 replies

wlandau Jul 30, 2024 Maintainer

multimeric Aug 2, 2024 Author

wlandau Aug 2, 2024 Maintainer

wlandau Aug 2, 2024 Maintainer

multimeric Aug 12, 2024 Author

wlandau Sep 9, 2024 Maintainer

wlandau Oct 10, 2024 Maintainer

wlandau Nov 4, 2024 Maintainer

multimeric Nov 4, 2024 Author

wlandau Nov 5, 2024 Maintainer

wlandau Nov 5, 2024 Maintainer

multimeric
Jul 30, 2024

Replies: 3 comments 33 replies

wlandau
Jul 30, 2024
Maintainer

multimeric Aug 2, 2024
Author

wlandau Aug 2, 2024
Maintainer

wlandau Aug 2, 2024
Maintainer

multimeric Aug 12, 2024
Author

wlandau Sep 9, 2024
Maintainer

wlandau
Oct 10, 2024
Maintainer

wlandau
Nov 4, 2024
Maintainer

multimeric Nov 4, 2024
Author

wlandau Nov 5, 2024
Maintainer

wlandau Nov 5, 2024
Maintainer