Worker connection timeouts in a large pipeline #179

wlandau · 2024-08-04T16:43:46Z

wlandau
Aug 4, 2024
Maintainer

To better diagnose #176, I revisited the targets pipeline at https://github.com/openpharma/brms.mmrm/tree/main/vignettes/sbc, a simulation study to validate a Bayesian model using simulation-based calibration (SBC) checking.

I turned on local memory monitoring using the log_resources argument of the controller. Memory usage of the local targets process and mirai dispatcher process looks fine:

And even after running for 2 days, the dispatcher and pipeline are still running. So that's at least good news.

However, I am seeing a lot of timeouts when the workers are first trying to dial in, and the worker logs look strange. The targets pipeline tries to keep 50 workers going at a time, and if if I list the number of lines of each worker instance log file, I see:

$ wc *.o*
     25     190    3284 crew-8998d30364ccd35cc762e84c-10-7dd58310749e9bc0fe6c2acd.o139220417
     25     190    3280 crew-8998d30364ccd35cc762e84c-10-d8e0091466d2e3dd07b0cd44.o139177068
     25     190    3282 crew-8998d30364ccd35cc762e84c-11-6460566e27c39340faef4e0b.o139177069
     25     190    3279 crew-8998d30364ccd35cc762e84c-11-d705eaad819756d95a456e35.o139220418
     25     190    3280 crew-8998d30364ccd35cc762e84c-12-2eb25b69a0e3b7a71485dc8e.o139220419
     25     190    3301 crew-8998d30364ccd35cc762e84c-1-240f6d061df59dbc0061a7e7.o139177049
     25     190    3283 crew-8998d30364ccd35cc762e84c-12-aa9db6790520e0a198ea1f99.o139177070
     25     190    3283 crew-8998d30364ccd35cc762e84c-13-7d13e2ce9166a8829e0b8c53.o139177071
     25     190    3281 crew-8998d30364ccd35cc762e84c-13-9202a7b12134a746b5c8fd16.o139220420
     25     190    3284 crew-8998d30364ccd35cc762e84c-14-482043d825d448efb84b2921.o139177072
     25     190    3278 crew-8998d30364ccd35cc762e84c-14-8bceebca665770944c120ada.o139220421
     25     190    3283 crew-8998d30364ccd35cc762e84c-15-0f828bd15a6d8cd19eabdb46.o139220422
     25     190    3283 crew-8998d30364ccd35cc762e84c-15-d85f5e1b860c78ed71e7de5c.o139177073
     25     190    3295 crew-8998d30364ccd35cc762e84c-16-0cfb51f0c945e328064c507f.o139177074
     25     190    3276 crew-8998d30364ccd35cc762e84c-16-d7f1c3f1d113f734ba9a2a7c.o139220424
     25     190    3283 crew-8998d30364ccd35cc762e84c-17-0daa81edb980bc5cb8e7c2fe.o139177075
     25     190    3281 crew-8998d30364ccd35cc762e84c-17-c3eac465551bb125fca02652.o139220423
     25     190    3282 crew-8998d30364ccd35cc762e84c-18-7ed5b450b367b3b443e3929d.o139177077
     25     190    3284 crew-8998d30364ccd35cc762e84c-18-95c1f97b81f32c885d6b463c.o139220425
     25     190    3281 crew-8998d30364ccd35cc762e84c-19-72c904cb0e61a790a4026d4c.o139220426
     25     190    3279 crew-8998d30364ccd35cc762e84c-19-80460ae91ee24f444a82ecca.o139177076
     25     190    3281 crew-8998d30364ccd35cc762e84c-1-a5c82bbe7a9b833b3f3aec34.o139220408
  29783  196926 1220136 crew-8998d30364ccd35cc762e84c-20-5035067e96d7145af13f76c3.o139177078
  25248  167059 1036590 crew-8998d30364ccd35cc762e84c-21-920539f1e82b789da59e4780.o139177079
     25     190    3284 crew-8998d30364ccd35cc762e84c-22-215a0414c441057061967616.o139177080
     25     190    3284 crew-8998d30364ccd35cc762e84c-22-d033eaa8da97c94eaa158210.o139220427
     25     190    3281 crew-8998d30364ccd35cc762e84c-23-1f4a0d65a4224fc105108030.o139177081
     25     190    3284 crew-8998d30364ccd35cc762e84c-23-23f2b47222c15df2c04a9c5d.o139220429
     25     190    3286 crew-8998d30364ccd35cc762e84c-24-693fdf95270a708165877e53.o139177082
     25     190    3285 crew-8998d30364ccd35cc762e84c-24-9e4b2029359ef54882fe78d9.o139220428
     25     190    3282 crew-8998d30364ccd35cc762e84c-25-7cbc4a68df0e9a1ef390120a.o139220431
     25     190    3283 crew-8998d30364ccd35cc762e84c-25-ba23c3430cace4d178a02cdc.o139177083
     25     190    3282 crew-8998d30364ccd35cc762e84c-26-53be68d0f38b925636c07133.o139177086
     25     190    3280 crew-8998d30364ccd35cc762e84c-26-e14869bb187b547ceaca5a5a.o139220430
     25     190    3295 crew-8998d30364ccd35cc762e84c-27-38366e0e23db6f1f1220570a.o139177084
     25     190    3283 crew-8998d30364ccd35cc762e84c-27-5405e5c459e9333b4a63a510.o139220432
     25     190    3285 crew-8998d30364ccd35cc762e84c-2-7892d96e94fbbd146837c498.o139220409
     25     190    3282 crew-8998d30364ccd35cc762e84c-28-32a4bd05267846a1907b720d.o139177087
     25     190    3287 crew-8998d30364ccd35cc762e84c-28-934469cbe80f583b9714ac42.o139220433
     25     190    3285 crew-8998d30364ccd35cc762e84c-29-2b9a325f67e18c603b46a321.o139177085
     25     190    3284 crew-8998d30364ccd35cc762e84c-29-727f73f97181bc19bac2fb77.o139220434
     25     190    3277 crew-8998d30364ccd35cc762e84c-2-a017551c126ef20927edd347.o139177052
     25     190    3281 crew-8998d30364ccd35cc762e84c-30-6bed2df7258d26c3150a53c3.o139177089
     25     190    3282 crew-8998d30364ccd35cc762e84c-30-e1b0a8a3fa398bfb8f025ee2.o139220435
  27725  183358 1136751 crew-8998d30364ccd35cc762e84c-31-9b42a6e3b01cfdb6feaab3d6.o139177088
     25     190    3281 crew-8998d30364ccd35cc762e84c-32-7d0799db358fd601a17a9072.o139177090
     25     190    3279 crew-8998d30364ccd35cc762e84c-32-d77f5b79b90bf156a14c2773.o139220436
  27216  180026 1116696 crew-8998d30364ccd35cc762e84c-33-0cc8dc5ae537d4fb542145d1.o139177093
     25     190    3283 crew-8998d30364ccd35cc762e84c-34-2ffa3a57109ecc4c98c94beb.o139220437
     25     190    3283 crew-8998d30364ccd35cc762e84c-34-a77ec36d3e8bfc55431c7af7.o139177092
     25     190    3283 crew-8998d30364ccd35cc762e84c-35-2496ebdfd88902ffa535666c.o139177091
     25     190    3285 crew-8998d30364ccd35cc762e84c-35-abfe50a7bdba705c80e1ed56.o139220438
     25     190    3284 crew-8998d30364ccd35cc762e84c-36-59116ea29582de55b1eb0e63.o139177094
     25     190    3283 crew-8998d30364ccd35cc762e84c-36-7a09b8e6f4269ad72de033ea.o139220439
     25     190    3278 crew-8998d30364ccd35cc762e84c-3-7a84b3f759c50c65ed5733a3.o139220410
     25     190    3284 crew-8998d30364ccd35cc762e84c-37-dd3e348efd5c4846485bd4f9.o139220440
     25     190    3285 crew-8998d30364ccd35cc762e84c-37-f3eda0a1c7c8fc054ea7b618.o139177095
     25     190    3280 crew-8998d30364ccd35cc762e84c-38-68d9d126ad5482d5a80351c0.o139177096
     25     190    3283 crew-8998d30364ccd35cc762e84c-38-b17529c1465f725b1a3c7a3c.o139220441
     25     190    3284 crew-8998d30364ccd35cc762e84c-39-a4ada26581e216545b714d5f.o139220442
     25     190    3284 crew-8998d30364ccd35cc762e84c-39-e633bbb28c42c2064d76e85e.o139177097
     25     190    3280 crew-8998d30364ccd35cc762e84c-3-a03fcc41dbf62c8f8346594f.o139177050
  27088  179243 1111690 crew-8998d30364ccd35cc762e84c-40-b29bdd4ab6340be05c7ae32f.o139177099
     25     190    3280 crew-8998d30364ccd35cc762e84c-41-3bd09ee778090499c3ebc505.o139177098
     25     190    3282 crew-8998d30364ccd35cc762e84c-41-98b3c874f36079fd73c4928d.o139220443
     25     190    3295 crew-8998d30364ccd35cc762e84c-42-ee0fdc1c4cd403f2bef1595d.o139177100
     25     190    3283 crew-8998d30364ccd35cc762e84c-42-f3c91961920bbf4d3c88e6bb.o139220447
     25     190    3285 crew-8998d30364ccd35cc762e84c-43-849fa3a104d36704dc44a6d6.o139177101
     25     190    3285 crew-8998d30364ccd35cc762e84c-43-de29a6e9228cd12e1b9821a9.o139220444
     25     190    3283 crew-8998d30364ccd35cc762e84c-44-d7d90e29c410862c8b562773.o139220449
     25     190    3297 crew-8998d30364ccd35cc762e84c-44-e2ecc3e779e64282adfe253d.o139177102
     25     190    3282 crew-8998d30364ccd35cc762e84c-45-85ddc88d3593a0711c1d8cf3.o139177103
     25     190    3282 crew-8998d30364ccd35cc762e84c-45-c3e5c384fd231548453745d6.o139220448
     25     190    3281 crew-8998d30364ccd35cc762e84c-46-6b099107f966d8c872d5d283.o139220450
     25     190    3282 crew-8998d30364ccd35cc762e84c-46-73f041a0c7cce3c66ac151a2.o139177104
     25     190    3284 crew-8998d30364ccd35cc762e84c-47-03230613b97fc8aee4becf61.o139177107
     25     190    3282 crew-8998d30364ccd35cc762e84c-47-08bdc23cbda9df41c914119d.o139220452
  25146  166368 1032619 crew-8998d30364ccd35cc762e84c-4-7d506decd40c9e13dce4dd16.o139177051
     25     190    3282 crew-8998d30364ccd35cc762e84c-48-c6c6ea14cd0b6a7bfa66e6a5.o139177105
     25     190    3282 crew-8998d30364ccd35cc762e84c-48-d499acd5cb258458b3f5a476.o139220454
     25     190    3285 crew-8998d30364ccd35cc762e84c-49-3c2ba8c4c83a4847652d5cf0.o139220455
     25     190    3281 crew-8998d30364ccd35cc762e84c-49-c9aeafed5459fa8189a910de.o139177106
  28024  185325 1148491 crew-8998d30364ccd35cc762e84c-50-763375b0786efa103f2267ce.o139177108
     25     190    3283 crew-8998d30364ccd35cc762e84c-5-2c1a5ba67276cad280014417.o139177054
     25     190    3281 crew-8998d30364ccd35cc762e84c-5-625ffaeeac63bdd91c35fba6.o139220411
     25     190    3279 crew-8998d30364ccd35cc762e84c-6-3574140edf62e1123126b4dd.o139177053
     25     190    3282 crew-8998d30364ccd35cc762e84c-6-9215cdf414f5968db97072c1.o139220412
     25     190    3281 crew-8998d30364ccd35cc762e84c-7-2e7d3e8521786c1f96c23494.o139220413
     25     190    3283 crew-8998d30364ccd35cc762e84c-7-7f1f7b579124194a2946792b.o139177055
     25     190    3282 crew-8998d30364ccd35cc762e84c-8-00c0afdd491f0dbfb7628ddc.o139177067
     25     190    3279 crew-8998d30364ccd35cc762e84c-8-1a8cd9ce9818f638310e760a.o139220415
     25     190    3281 crew-8998d30364ccd35cc762e84c-9-37d14db408ca301e1190b8fe.o139220416
     25     190    3282 crew-8998d30364ccd35cc762e84c-9-e478e8ecd0ee89b8f3b502fd.o139177066

The logs with ~25000 lines are successfully running Stan models:

$ tail tail crew-8998d30364ccd35cc762e84c-50-763375b0786efa103f2267ce.o139177108
Chain 2: Iteration: 2400 / 8000 [ 30%]  (Warmup)
Chain 1: Iteration: 3200 / 8000 [ 40%]  (Warmup)
Chain 3: Iteration: 3200 / 8000 [ 40%]  (Warmup)
Chain 2: Iteration: 3200 / 8000 [ 40%]  (Warmup)
Chain 1: Iteration: 4000 / 8000 [ 50%]  (Warmup)
Chain 1: Iteration: 4001 / 8000 [ 50%]  (Sampling)
Chain 3: Iteration: 4000 / 8000 [ 50%]  (Warmup)
Chain 3: Iteration: 4001 / 8000 [ 50%]  (Sampling)
Chain 2: Iteration: 4000 / 8000 [ 50%]  (Warmup)
Chain 2: Iteration: 4001 / 8000 [ 50%]  (Sampling)

But the ones with few lines are having trouble dialing in:

$ cat crew-8998d30364ccd35cc762e84c-9-e478e8ecd0ee89b8f3b502fd.o139177066
crew::crew_worker(settings = list(url = "wss://172.17.0.41:37348/9/e478e8ecd0ee89b8f3b502fd", autoexit = 15L, cleanup = 1L, output = TRUE, maxtasks = Inf, idletime = 120000, walltime = Inf, timerstart = 0L, tls = c("-----BEGIN CERTIFICATE-----...CENSORED...\n-----END CERTIFICATE-----\n", ""), rs = c(10407L, -558776626L, 379657431L, -2083588772L, 1234723516L, 1102356427L, -1260765952L)), launcher = "8998d30364ccd35cc762e84c", worker = 9L, instance = "e478e8ecd0ee89b8f3b502fd")
Error in dial(sock, url = url, autostart = asyncdial || NA, tls = tls,  : 
  5 | Timed out
Calls: <Anonymous> ... do.call -> <Anonymous> -> dial_and_sync_socket -> dial
Execution halted

I do need to implement #178, but I find it hard to believe this would be a memory issue. Maybe it's because I'm running nanonext_1.1.1.9015 and mirai_1.1.1.9012 ? I will upgrade and try again. Hopefully the resources on the cluster will still be available.

Session info:

R version 4.3.2 (2023-10-31)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Red Hat Enterprise Linux

Matrix products: default
BLAS/LAPACK: /lrlhps/apps/intel/intel-2020/compilers_and_libraries_2020.0.166/linux/mkl/lib/intel64_lin/libmkl_gf_lp64.so;  LAPACK version 3.7.0

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

time zone: America/Indiana/Indianapolis
tzcode source: system (glibc)

attached base packages:
[1] stats     graphics  grDevices datasets  utils     methods   base     

other attached packages:
[1] tarchetypes_0.9.1.9000 targets_1.7.1.9005    

loaded via a namespace (and not attached):
 [1] base64url_1.4           dplyr_1.1.4             compiler_4.3.2         
 [4] renv_1.0.3              promises_1.2.1          tidyselect_1.2.0       
 [7] Rcpp_1.0.12             xml2_1.3.5              parallel_4.3.2         
[10] callr_3.7.3             later_1.3.1             yaml_2.3.7             
[13] R6_2.5.1                generics_0.1.3          igraph_2.0.2           
[16] nanonext_1.1.1.9015     knitr_1.45              backports_1.4.1        
[19] tibble_3.2.1            mirai_1.1.1.9012        pillar_1.9.0           
[22] rlang_1.1.2             utf8_1.2.4              xfun_0.41              
[25] fs_1.6.3                cli_3.6.1               withr_2.5.2            
[28] magrittr_2.0.3          ps_1.7.5                processx_3.8.2         
[31] crew_0.9.5.9007         secretbase_0.5.0        lifecycle_1.0.4        
[34] vctrs_0.6.4             glue_1.6.2              data.table_1.14.8      
[37] codetools_0.2-19        getip_0.1-3             fansi_1.0.5            
[40] crew.cluster_0.3.2.9004 tools_4.3.2             pkgconfig_2.0.3

FYI @shikokuchuo, @multimeric

wlandau · 2024-08-04T16:54:43Z

wlandau
Aug 4, 2024
Maintainer Author

Looks like upgrading to mirai 1.1.1.9014 and nanonext version 1.1.1.9016 solved the original worker timeout problem, at least initially. Let's see if the pipeline can complete over the next few days.

This might not be a perfect test case of #176 because each target only saves a few KB of data, but maybe a next test could be a version that saves entire brms models.

4 replies

multimeric Aug 4, 2024

Does 1.1.1.9X have substantial changes from 1.1.1?

shikokuchuo Aug 5, 2024

Yes, it will be a minor point release when it happens. Nothing I can see that changes memory handling though.

wlandau Aug 5, 2024
Maintainer Author

FYI that pipeline of mine completed successfully. I am rerunning a different version which saves and loads brms model objects on parallel workers. Each of these objects is about 200 MB in memory and 45 MB on disk (compressed with qs). I am about 870 targets into the pipeline, far enough that I would have seen a memory issue if it was going to happen. So far, dispatcher memory never exceeded 230 MB, and client memory never exceeded 463 MB. Median dispatcher memory is 200 MB. So I don't think I can reproduce #176.

wlandau Aug 7, 2024
Maintainer Author

I am rerunning a different version which saves and loads brms model objects on parallel workers.

That pipeline is just like https://github.com/openpharma/brms.mmrm/tree/main/vignettes/sbc, but with:

list(
  tar_map(
    values = scenarios,
    names = tidyselect::any_of("name"),
    tar_target(prior, setup_prior(scenario)),
    tar_target(rep, seq_len(1000L)),
    tar_target(
      model1,
      run_simulation(
        scenario = scenario,
        prior = prior,
        chains = 3L,
        warmup = 4000L,
        iter = 8000L
      ),
      pattern = map(rep)
    ),
    tar_target(model2, model1, pattern = map(model1)),
    tar_target(model3, model2, pattern = map(model2))
  )
)

Each of 6 scenarios looks like this:

The whole pipeline has 18012 targets, and 18000 of those targets is 200 MB in memory each. Here is what the memory over time looked like:

Notably, the mirai dispatcher memory stayed constant, so it seems like the workers were saving and loading the brms objects. The isolated callr process memory may have been increasing because of all the dynamic branches added to the pipeline. This could explain it even if no brms objects were loaded into memory.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Worker connection timeouts in a large pipeline #179

{{title}}

Replies: 1 comment 4 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Worker connection timeouts in a large pipeline #179

wlandau Aug 4, 2024 Maintainer

Replies: 1 comment · 4 replies

wlandau Aug 4, 2024 Maintainer Author

multimeric Aug 4, 2024

shikokuchuo Aug 5, 2024

wlandau Aug 5, 2024 Maintainer Author

wlandau Aug 7, 2024 Maintainer Author

wlandau
Aug 4, 2024
Maintainer

Replies: 1 comment 4 replies

wlandau
Aug 4, 2024
Maintainer Author

wlandau Aug 5, 2024
Maintainer Author

wlandau Aug 7, 2024
Maintainer Author