Skip to content

Commit 4590e35

Browse files
committed
use the better functional approach
1 parent fddd050 commit 4590e35

File tree

1 file changed

+15
-19
lines changed

1 file changed

+15
-19
lines changed

vignettes/HPC-computing.Rmd

Lines changed: 15 additions & 19 deletions
Original file line numberDiff line numberDiff line change
@@ -347,7 +347,7 @@ You should now consider moving this `"final_sim.rds"` off the Slurm landing node
347347

348348
# Array jobs and multicore computing simultaneously
349349

350-
Of course, nothing really stops you from mixing and matching the above ideas related to multicore computing and array jobs on Slurm and other HPC clusters. For example, if you wanted to take the original `design` object and submit batches of these instead (e.g., submit one or more rows of the `design` object as an array job), where within each batch multicore processing is requested, then something like the following would work just fine:
350+
Of course, nothing really stops you from mixing and matching the above ideas related to multicore computing and array jobs on Slurm and other HPC clusters. For example, if you wanted to take the original `design` object and submit batches of these instead (e.g., submit one or more rows of the `design` object as an array job), where within each array multicore processing is requested, then something like the following would work just fine:
351351

352352
```
353353
#!/bin/bash
@@ -400,30 +400,26 @@ multirow <- FALSE # submit multiple rows of Design object to array?
400400
if(multirow){
401401
# If selecting multiple design rows per array, such as the first 3 rows,
402402
# then next 3 rows, and so on, something like the following would work
403-
s <- c(seq(from=1, to=nrow(Design), by=3), nrow(Design)+1L)
404403
405-
## For arrayID=1, rows2pick is c(1,2,3); for arrayID2, rows2pick is c(4,5,6)
406-
rows2pick <- s[arrayID]:(s[arrayID + 1] - 1)
407-
filename <- paste0('mysim-', paste0(rows2pick, collapse=''))
404+
## For arrayID=1, rows 1 through 3 are evaluated
405+
## For arrayID=2, rows 4 through 6 are evaluated
406+
## For arrayID=3, rows 7 through 9 are evaluated
407+
array2row <- function(arrayID) 1:3 + 3 * (arrayID-1)
408408
} else {
409-
# otherwise, submit each row independently across array
410-
rows2pick <- arrayID
411-
filename <- paste0('mysim-', rows2pick)
409+
# otherwise, use one row per respective arrayID
410+
array2row <- function(arrayID) arrayID
412411
}
413412
414-
# Make sure parallel=TRUE flag is on! Also, it's important to change the computer
415-
# name to something unique to the array job to avoid overwriting files (even temporary ones)
416-
runSimulation(design=Design[rows2pick, ], replications=10000,
417-
generate=Generate, analyse=Analyse, summarise=Summarise,
418-
parallel=TRUE, filename=filename,
419-
save_details=list(compname=paste0('array-', arrayID)))
413+
# Make sure parallel=TRUE flag is on to use all available cores!
414+
runArraySimulation(design=Design, replications=10000,
415+
generate=Generate, analyse=Analyse, summarise=Summarise,
416+
iseed=iseed, dirname='mysimfiles', filename='mysim',
417+
parallel=TRUE, arrayID=arrayID, array2row=array2row)
420418
```
421419

422-
When complete, the function `SimCollect()` can again be used to put the simulation results together given the nine saved files (or three, if `multirow` is `TRUE` and `#SBATCH --array=1-3` were used instead).
420+
When complete, the function `SimCollect()` can again be used to put the simulation results together given the nine saved files (nine files would also saved were `multirow` set to `TRUE` and `#SBATCH --array=1-3` were used instead as these are stored on a per-row basis).
423421

424-
This type of hybrid approach is a middle ground between submitting the complete job (top of this vignette) and the `condition` + `replication` distributed load in the previous section, though has similar overhead + inefficiency issues as before (though less so, as the `array` jobs are evaluated independently). Moreover, if the row's take very different amounts of time to evaluate then this strategy can prove inefficient (e.g., the first two rows may take 2 hours to complete, while the third row may take 12 hours to complete; hence, the complete simulation results would not be available until the most demanding simulation conditions are returned!). Nevertheless, for moderate intensity simulations the above approach can be sufficient as each (batch) of simulation conditions can be evaluated independently across each `array` on the HPC cluster.
425-
426-
For more intense simulations, particularly those prone to time outs or other exhausted resources, the `runArraySimulation()` approach remains the recommended approach as the `max_RAM` and `max_time` fail-safes are more naturally accommodated within the replications, the jobs can be explicitly distributed given the anticipated intensity of each simulation condition, and the quality and reproducibility of multiple job submissions is easier to manage (see the FAQ section below).
422+
This type of hybrid approach is a middle ground between submitting the complete job (top of this vignette) and the `condition` + `replication` distributed load in the previous section, though has similar overhead + inefficiency issues as before (though less so, as the `array` jobs are evaluated independently). Note that if the row's take very different amounts of time to evaluate then this strategy can prove less efficient (e.g., the first two rows may take 2 hours to complete, while the third row may take 12 hours to complete), in which case a more nuanced `array2row()` function should be defined to help explicit balance the load on the computing cluster.
427423

428424
# Extra information (FAQs)
429425

@@ -443,7 +439,7 @@ scancel -u <username> # cancel all queued and running jobs for a specific user
443439

444440
This issue is important whenever the HPC cluster has mandatory time/RAM limits for the job submissions, where the array job may not complete within the assigned resources --- hence, if not properly managed, will discard any valid replication information when abruptly terminated. Unfortunately, this is a very likely occurrence, and is largely a function of being unsure about how long each simulation condition/replication will take to complete when distributed across the arrays (some conditions/replications will take longer than others, and it is difficult to be perfectly knowledgeable about this information beforehand) or how large the final objects will grow as the simulation progresses.
445441

446-
To avoid this time/resource waste it is **strongly recommended** to add a `max_time` and/or `max_RAM` argument to the `control` list (see `help(runArraySimulation)` for supported specifications), which are less than the Slurm specifications. These control flags will halt the `runArraySimulation()`/`runSimulation()` executions early and return only the complete simulation results up to this point. However, this will only work if these arguments are *non-trivially less than the allocated Slurm resources*; otherwise, you'll run the risk that the job terminates before the `SimDesign` functions have the chance to store the successfully completed replications. Setting these to around 90-95% of the respective `#SBATCH --time=` and `#SBATCH --mem=` inputs should, however, be sufficient in most cases.
442+
To avoid this time/resource waste it is **strongly recommended** to add a `max_time` and/or `max_RAM` argument to the `control` list (see `help(runArraySimulation)` for supported specifications), which are less than the Slurm specifications. These control flags will halt the `runArraySimulation()` executions early and return only the complete simulation results up to this point. However, this will only work if these arguments are *non-trivially less than the allocated Slurm resources*; otherwise, you'll run the risk that the job terminates before the `SimDesign` functions have the chance to store the successfully completed replications. Setting these to around 90-95% of the respective `#SBATCH --time=` and `#SBATCH --mem=` inputs should, however, be sufficient in most cases.
447443

448444
```{r eval=FALSE}
449445
# Return successful results up to the 11 hour mark, and terminate early

0 commit comments

Comments
 (0)