Help with running targets with crew on large pipeline #156

grobins · 2024-03-13T06:30:04Z

grobins
Mar 13, 2024

My pipeline does the following:

reads in a CSV file with GPS coordinates from 2.2million trips.
splits the file into 1000 local files, and creates a 1000 targets, one for each file, split across 50 crew workers
calls a local API to snap those coordinates to the road network. each file does approx 2200 trips.
Then it needs to consolidate the results into one file, and this is where I get the error below

Last error message: {crew} worker 1 launched 5 times in a row without completing any tasks. Either troubleshoot or raise launch_max above 5.

The error message suggests that the memory requirements can be altered, but I don't see how. I'm running this on an AWS EC2 c5a.8xlarge with 32vCPUs and 64Gb RAM.

this is the controller Im using

controller <- crew::crew_controller_local( name = "snapping", workers = 1, launch_max = 10L, local_log_directory = '/tmp/crew', garbage_collection = TRUE )

any help to overcome this would be appreciated

wlandau · 2024-03-13T15:01:39Z

wlandau
Mar 13, 2024
Maintainer

That's a strange error for a local controller. What do the log files in /tmp/crew/ say?

My first recommendation is to update crew, mirai, and nanonext all to their current releases on your EC2 instance: install.packages(c("nanonext", "mirai", "crew")). Next, I would try to see if a simple pipeline produces the same error. If it does, then try submitting tasks using crewwithouttargets` in order to isolate it further.

3 replies

grobins Mar 13, 2024
Author

Thanks I'll give that a go today and reply. I'm new to crew and I really do love using it, so thanks heaps for the mahi

grobins Mar 13, 2024
Author

this is what one of the logs say

`R version 4.3.1 (2023-06-16) -- "Beagle Scouts"
Copyright (C) 2023 The R Foundation for Statistical Computing
Platform: x86_64-pc-linux-gnu (64-bit)

R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.

R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.

Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.

crew::crew_worker(settings = list(url = "ws://127.0.0.1:41113/5/1e3b68753fc29ecfcd98a373", autoexit = 15L, cleanup = 9L, output = TRUE, maxtasks = Inf, idletime = 3000, walltime = Inf, timerstart = 0L, tls = NULL, rs = c(10407L, -491988149L, 187508871L, 491787305L, 1389190564L, -1071591146L, 630222186L)), launcher = "snapping", worker = 10L, instance = "1e3b68753fc29ecfcd98a373")`

grobins Mar 14, 2024
Author

I managed to reproduce the error with a simply pipeline. and then tried to push the job manually to crew. I didn't succeed here, mainly because I wasn't sure how to do this. That attempt is here:

controller <- crew::crew_controller_local(
  name = "snapping_test", 
  workers = 1,
  launch_max = 5L,
  local_log_directory = '/tmp/crew',
  seconds_idle = 3,
  garbage_collection = TRUE
)

controller$start()


tar_load(summariseTripTuples)

controller$push(packages = c('tidyverse'),
                  name = "summariseTripTuples", command = {
                    summariseTripTuples %>%
  group_by(node_id1, node_id2, weight_category) %>%
  summarise(vehicle_count = 1,
            trips_n = sum(trips_n, na.rm = T)) %>%
  ungroup()
})


monitor <- crew_monitor_local()
monitor$dispatchers()

controller$wait(mode = "all")
controller$summary()
controller$launcher$summary()
controller$client$summary()



task <- controller$pop() # Worker started, task complete.
task
task$result[[1]]
task$error
controller$terminate()
crew_clean()

I didn't know how to pass in the dataframe summariseTripTuples. But because I had loaded it, I saw it was 371million rows and during the loading my RAM hit 57Gb. So I figured I was just asking it too much. Instead I am now reading in each of the 1000 targets in batches of 10 and appending them to a running grouping. That's levelling out at about 2.1million rows. That code is here... I'll write it into the pipeline shortly.

target_names <- tar_objects()
target_names <- tarobjectslist[str_detect(target_names,'summariseTripTuples')]

# Define batch size
batch_size <- 10

# Initial setup for the final result
result_summariseTripTuples <- NULL

# Function to process and combine data frames
process_combine <- function(data_frames) {
  bind_rows(data_frames) %>%
    group_by(node_id1, node_id2, weight_category) %>%
    summarise(vehicle_count = 1,
              trips_n = sum(trips_n, na.rm = TRUE),
              .groups = 'drop')
}

# Split target names into batches
target_batches <- split(target_names, ceiling(seq_along(target_names) / batch_size))

# Iterate over each batch
for (batch in target_batches) {
  # Load all targets in the current batch

  batch_data <- lapply(batch, function(target_name) {
    filelocation <- gsub(' ', '', paste("_targets/objects/", target_name))
    qs::qread(filelocation)
  })
  
  
  # Combine all data frames in the current batch
  batch_combined <- process_combine(batch_data)
  
  # Combine with the result from previous batches
  if (is.null(result_summariseTripTuples)) {
    result_summariseTripTuples <- batch_combined
  } else {
    result_summariseTripTuples <- process_combine(list(result_summariseTripTuples, batch_combined))
  }
  print(nrow(result_summariseTripTuples))
  rm(batch_data, batch_combined)
  gc() 
}

and that failed in the pipeline too
Last error message: {crew} worker 2 launched 5 times in a row without completing any tasks. Either troubleshoot or raise launch_max above 5. Details: https://wlandau.github.io/crew/articles/risks.html#crashes Last error traceback: tryCatch(withCallingHandlers({ NULL saveRDS(do.call(do.call, c(readRDS("... tryCatchList(expr, classes, parentenv, handlers) tryCatchOne(tryCatchList(expr, names[-nh], parentenv, handlers[-nh]), na... doTryCatch(return(expr), name, parentenv, handler) tryCatchList(expr, names[-nh], parentenv, handlers[-nh]) tryCatchOne(expr, names, parentenv, handlers[[1L]]) doTryCatch(return(expr), name, parentenv, handler) withCallingHandlers({ NULL saveRDS(do.call(do.call, c(readRDS("/tmp/Rtmp... saveRDS(do.call(do.call, c(readRDS("/tmp/RtmpdOqiwb/callr-fun-1418222cb3... do.call(do.call, c(readRDS("/tmp/RtmpdOqiwb/callr-fun-1418222cb367"), li... (function (what, args, quote = FALSE, envir = parent.frame()) { if (!is.... (function (targets_function, targets_arguments, options, envir = NULL, s... tryCatch(out <- withCallingHandlers(targets::tar_callr_inner_try(targets... tryCatchList(expr, classes, parentenv, handlers) tryCatchOne(expr, names, parentenv, handlers[[1L]]) doTryCatch(return(expr), name, parentenv, handler) withCallingHandlers(targets::tar_callr_inner_try(targets_function = targ... targets::tar_callr_inner_try(targets_function = targets_function, target... do.call(targets_function, targets_arguments) (function (pipeline, path_store, names_quosure, shortcut, reporter, seco... crew_init(pipeline = pipeline, meta = meta_init(path_store = path_store)... self$run_crew() self$iterate() self$conclude_worker_task() self$controller$pop(scale = TRUE, throttle = TRUE) .subset2(self, "scale")(throttle = throttle) private$.launcher$scale(demand = self$unresolved(), throttle = throttle) walk(x = which(is_unlaunched & is_backlogged), f = self$launch) lapply(X = x, FUN = as_function(f), ...) FUN(X[[i]], ...) crew_assert(futile <= private$.launch_max, message = paste0("{crew} work... crew_error(message %|||% out) crew_stop(message = message, class = c("crew_error", "crew")) rlang::abort(message = message, class = class, call = emptyenv()) signal_abort(cnd, .file)

grobins · 2024-03-14T09:37:33Z

grobins
Mar 14, 2024
Author

OK so I did this and pushed to one worker and it worked. Still don't know why it doesn't work within a targets pipeline. I updated all the packages.

collateNodeCounts <- function(summariseTripTuples){
  target_names <- tar_objects()
  target_names <- target_names[str_detect(target_names,'summariseTripTuples')]
  batch_size <- 10
  result_summariseTripTuples <- NULL
  
  # Function to process and combine data frames
  process_combine <- function(data_frames) {
    bind_rows(data_frames) %>%
      group_by(node_id1, node_id2, weight_category) %>%
      summarise(vehicle_count = 1,
                trips_n = sum(trips_n, na.rm = TRUE),
                .groups = 'drop')
  }
  
  # Split target names into batches
  target_batches <- split(target_names, ceiling(seq_along(target_names) / batch_size))
  
  # Iterate over each batch
  for (batch in target_batches) {
    # Load all targets in the current batch  
    batch_data <- lapply(batch, function(target_name) {
      filelocation <- gsub(' ', '', paste("_targets/objects/", target_name))
      qs::qread(filelocation)
    })
    
    # Combine all data frames in the current batch
    batch_combined <- process_combine(batch_data)
    
    # Combine with the result from previous batches
    if (is.null(result_summariseTripTuples)) {
      result_summariseTripTuples <- batch_combined
    } else {
      result_summariseTripTuples <- process_combine(list(result_summariseTripTuples, batch_combined))
    }
    print(nrow(result_summariseTripTuples))
    rm(batch_data, batch_combined)
    gc() 
  }
  
  return(result_summariseTripTuples)
}

controller <- crew::crew_controller_local(
  name = "snapping_test", 
  workers = 1,
  launch_max = 5L,
  local_log_directory = '/tmp/crew',
  seconds_idle = 3,
  garbage_collection = TRUE
)

controller$start()

controller$push(packages = c('tidyverse', 'targets'),
                  name = "summariseTripTuples", command = {
                    collateNodeCounts()
})

controller$wait(mode = 'all')
task <- controller$pop() 
controller$terminate()
crew_clean()
task

The number of rows slowly increase towards the unique combinates of node1, node2, and weight category. This is better than the 371M rows!

I got a data frame with the expected number of results.

A tibble: 1 × 12
  name                command result                   seconds  seed algorithm error trace warnings launcher      worker instance 
  <chr>               <chr>   <list>                     <dbl> <int> <chr>     <chr> <chr> <chr>    <chr>          <int> <chr>    
1 summariseTripTuples NA      <tibble [2,295,481 × 5]>   1557.    NA NA        NA    NA    NA       snapping_test      1 7edd89e3…

0 replies

brendanf · 2024-03-14T09:59:40Z

brendanf
Mar 14, 2024

The default behavior of targets is to load all dependencies prior to beginning to run the code, so this kind of behavior where you load targets one or a few at a time, process them, and then gc() before loading more, is not directly supported. However, you can disable the preloading behavior using tar_target(..., retrieval = "none"), in which case targets still tracks the dependencies, but it is up to you to load them within the code for the target.

You can also use file targets for big data cases like this, in which case your summarizeTripTuples target would not return a data.frame for targets to store in _targets/objects, but would instead save the data itself in some other location, and return a character giving the filename. Then the value loaded for summarizeTripTuples in the command for result_summarizeTripTuples would bea vector of filenames, which you could load and process as necessary.

I am not aware that it is possible to do a "partial gather" in targets using only dynamic branching (i.e. process data in 100 batches in stage 1, then partially combine those batches into 10 batches in stage 2 without ever loading everything into memory at once, then fully combine those batches into one result in stage 3) but it could be done using dynamic-within-static or static-within-static branching.

2 replies

grobins Mar 14, 2024
Author

Thanks I was just in the middle of implementing saving summarizeTripTuples elsewhere and loading it inside the collateNodeCounts function. I think the fact I had

...
tar_target(summariseTripTuples, summariseTripsToNodeTuples(snapped, trip_attributes), map(snapped)),
tar_target(collatedNodeCounts, collateNodeCounts(summariseTripTuples)),
...

meant that collateNodeCounts needed summariseTripTuples as a dependency, but I was actually passing in the entire 371M rows, which is probably why it failed.

So are you saying I can do

...
tar_target(summariseTripTuples, summariseTripsToNodeTuples(snapped, trip_attributes), map(snapped)),
tar_target(collatedNodeCounts, collateNodeCounts(summariseTripTuples, retrieval = "none")),
...

To maintain the dependency but not load summariseTripTuples into memory?

brendanf Mar 14, 2024

It sounds like you got it working, but to be clear for anyone in the future which stumbles on this, retrieval should go outside the call to collateNodeCounts:

...
tar_target(summariseTripTuples, summariseTripsToNodeTuples(snapped, trip_attributes), map(snapped)),
tar_target(collatedNodeCounts, collateNodeCounts(summariseTripTuples), retrieval = "none"),
...

grobins · 2024-03-14T20:05:06Z

grobins
Mar 14, 2024
Author

Pipeline ran successfully. Thanks for all the help.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Help with running targets with crew on large pipeline #156

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 4 comments 5 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Help with running targets with crew on large pipeline #156

grobins Mar 13, 2024

Replies: 4 comments · 5 replies

wlandau Mar 13, 2024 Maintainer

grobins Mar 13, 2024 Author

grobins Mar 13, 2024 Author

grobins Mar 14, 2024 Author

grobins Mar 14, 2024 Author

brendanf Mar 14, 2024

grobins Mar 14, 2024 Author

brendanf Mar 14, 2024

grobins Mar 14, 2024 Author

grobins
Mar 13, 2024

Replies: 4 comments 5 replies

wlandau
Mar 13, 2024
Maintainer

grobins Mar 13, 2024
Author

grobins Mar 13, 2024
Author

grobins Mar 14, 2024
Author

grobins
Mar 14, 2024
Author

brendanf
Mar 14, 2024

grobins Mar 14, 2024
Author

grobins
Mar 14, 2024
Author