You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: vignettes/HPC-computing.Rmd
+15-19Lines changed: 15 additions & 19 deletions
Original file line number
Diff line number
Diff line change
@@ -347,7 +347,7 @@ You should now consider moving this `"final_sim.rds"` off the Slurm landing node
347
347
348
348
# Array jobs and multicore computing simultaneously
349
349
350
-
Of course, nothing really stops you from mixing and matching the above ideas related to multicore computing and array jobs on Slurm and other HPC clusters. For example, if you wanted to take the original `design` object and submit batches of these instead (e.g., submit one or more rows of the `design` object as an array job), where within each batch multicore processing is requested, then something like the following would work just fine:
350
+
Of course, nothing really stops you from mixing and matching the above ideas related to multicore computing and array jobs on Slurm and other HPC clusters. For example, if you wanted to take the original `design` object and submit batches of these instead (e.g., submit one or more rows of the `design` object as an array job), where within each array multicore processing is requested, then something like the following would work just fine:
351
351
352
352
```
353
353
#!/bin/bash
@@ -400,30 +400,26 @@ multirow <- FALSE # submit multiple rows of Design object to array?
400
400
if(multirow){
401
401
# If selecting multiple design rows per array, such as the first 3 rows,
402
402
# then next 3 rows, and so on, something like the following would work
403
-
s <- c(seq(from=1, to=nrow(Design), by=3), nrow(Design)+1L)
404
403
405
-
## For arrayID=1, rows2pick is c(1,2,3); for arrayID2, rows2pick is c(4,5,6)
When complete, the function `SimCollect()` can again be used to put the simulation results together given the nine saved files (or three, if `multirow`is `TRUE` and `#SBATCH --array=1-3` were used instead).
420
+
When complete, the function `SimCollect()` can again be used to put the simulation results together given the nine saved files (nine files would also saved were `multirow`set to `TRUE` and `#SBATCH --array=1-3` were used instead as these are stored on a per-row basis).
423
421
424
-
This type of hybrid approach is a middle ground between submitting the complete job (top of this vignette) and the `condition` + `replication` distributed load in the previous section, though has similar overhead + inefficiency issues as before (though less so, as the `array` jobs are evaluated independently). Moreover, if the row's take very different amounts of time to evaluate then this strategy can prove inefficient (e.g., the first two rows may take 2 hours to complete, while the third row may take 12 hours to complete; hence, the complete simulation results would not be available until the most demanding simulation conditions are returned!). Nevertheless, for moderate intensity simulations the above approach can be sufficient as each (batch) of simulation conditions can be evaluated independently across each `array` on the HPC cluster.
425
-
426
-
For more intense simulations, particularly those prone to time outs or other exhausted resources, the `runArraySimulation()` approach remains the recommended approach as the `max_RAM` and `max_time` fail-safes are more naturally accommodated within the replications, the jobs can be explicitly distributed given the anticipated intensity of each simulation condition, and the quality and reproducibility of multiple job submissions is easier to manage (see the FAQ section below).
422
+
This type of hybrid approach is a middle ground between submitting the complete job (top of this vignette) and the `condition` + `replication` distributed load in the previous section, though has similar overhead + inefficiency issues as before (though less so, as the `array` jobs are evaluated independently). Note that if the row's take very different amounts of time to evaluate then this strategy can prove less efficient (e.g., the first two rows may take 2 hours to complete, while the third row may take 12 hours to complete), in which case a more nuanced `array2row()` function should be defined to help explicit balance the load on the computing cluster.
427
423
428
424
# Extra information (FAQs)
429
425
@@ -443,7 +439,7 @@ scancel -u <username> # cancel all queued and running jobs for a specific user
443
439
444
440
This issue is important whenever the HPC cluster has mandatory time/RAM limits for the job submissions, where the array job may not complete within the assigned resources --- hence, if not properly managed, will discard any valid replication information when abruptly terminated. Unfortunately, this is a very likely occurrence, and is largely a function of being unsure about how long each simulation condition/replication will take to complete when distributed across the arrays (some conditions/replications will take longer than others, and it is difficult to be perfectly knowledgeable about this information beforehand) or how large the final objects will grow as the simulation progresses.
445
441
446
-
To avoid this time/resource waste it is **strongly recommended** to add a `max_time` and/or `max_RAM` argument to the `control` list (see `help(runArraySimulation)` for supported specifications), which are less than the Slurm specifications. These control flags will halt the `runArraySimulation()`/`runSimulation()` executions early and return only the complete simulation results up to this point. However, this will only work if these arguments are *non-trivially less than the allocated Slurm resources*; otherwise, you'll run the risk that the job terminates before the `SimDesign` functions have the chance to store the successfully completed replications. Setting these to around 90-95% of the respective `#SBATCH --time=` and `#SBATCH --mem=` inputs should, however, be sufficient in most cases.
442
+
To avoid this time/resource waste it is **strongly recommended** to add a `max_time` and/or `max_RAM` argument to the `control` list (see `help(runArraySimulation)` for supported specifications), which are less than the Slurm specifications. These control flags will halt the `runArraySimulation()` executions early and return only the complete simulation results up to this point. However, this will only work if these arguments are *non-trivially less than the allocated Slurm resources*; otherwise, you'll run the risk that the job terminates before the `SimDesign` functions have the chance to store the successfully completed replications. Setting these to around 90-95% of the respective `#SBATCH --time=` and `#SBATCH --mem=` inputs should, however, be sufficient in most cases.
447
443
448
444
```{r eval=FALSE}
449
445
# Return successful results up to the 11 hour mark, and terminate early
0 commit comments