Skip to content

Commit 18d7d0c

Browse files
committed
Batch files per computer (#240 & #287)
1 parent f28faa7 commit 18d7d0c

File tree

19 files changed

+457
-863
lines changed

19 files changed

+457
-863
lines changed

docs/documentation/running.md

Lines changed: 32 additions & 41 deletions
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,31 @@
22

33
MFC can be run using `mfc.sh`'s `run` command.
44
It supports both interactive and batch execution, the latter being designed for multi-socket systems, namely supercomputers, equipped with a scheduler such as PBS, SLURM, and LSF.
5-
A full (and updated) list of available arguments can be acquired with `./mfc.sh run -h`.
5+
A full (and updated) list of available arguments can be acquired with `./mfc.sh run -h`.
6+
7+
MFC supports running simulations locally (Linux, MacOS, and Windows) as well as
8+
several supercomputer clusters, both interactively and through batch submission.
9+
10+
> [!IMPORTANT]
11+
> Running simulations locally should work out of the box. On supported clusters,
12+
> you can append `-c <computer name>` on the command line to instruct the MFC toolchain
13+
> to make use of the template file `toolchain/templates/<computer name>.mako`. You can
14+
> browse that directory and contribute your own files. Since systems and their schedulers
15+
> do not have a standardized syntax to request certain resources, MFC can only provide
16+
> support for a restricted subset of common or user-contributed configuration options.
17+
>
18+
> Adding a new template file or modifying an existing one will most likely be required if:
19+
> - You are on a cluster that does not have a template yet.
20+
> - Your cluster is configured with SLURM and but fails when interactive jobs are
21+
> launched with `mpirun`.
22+
> - Something in the existing default or computer template file is incompatible with
23+
> your system or does not provide a feature you need.
24+
>
25+
> If `-c <computer name>` is left unspecified, it defaults to `-c default`.
26+
27+
Additional flags can be appended to the MPI executable call using the `-f` (i.e `--flags`) option.
28+
29+
Please refer to `./mfc.sh run -h` for a complete list of arguments and options, along with their defaults.
630

731
## Interactive Execution
832

@@ -32,24 +56,16 @@ using 4 cores:
3256
$ ./mfc.sh run examples/2D_shockbubble/case.py -t simulation post_process -n 4
3357
```
3458

35-
On some computer clusters, MFC might select the wrong MPI program to execute your application
36-
because it uses a general heuristic for its selection. Notably, `srun` is known to fail on some SLURM
37-
systems when using GPUs or MPI implementations from different vendors, whereas `mpirun` functions properly. To override and manually specify which
38-
MPI program you wish to run your application with, please use the `-b <program name>` option (i.e `--binary`).
39-
40-
Additional flags can be appended to the MPI executable call using the `-f` (i.e `--flags`) option.
41-
42-
Please refer to `./mfc.sh run -h` for a complete list of arguments and options, along with their defaults.
43-
4459
## Batch Execution
4560

4661
The MFC detects which scheduler your system is using and handles the creation and execution of batch scripts.
47-
The batch engine is requested with the `-e batch` option.
48-
Whereas the interactive engine can execute all of MFC's codes in succession, the batch engine requires you to only specify one target with the `-t` option.
49-
The number of nodes and GPUs can, respectively be specified with the `-N` (i.e `--nodes`) and `-g` (i.e `--gpus-per-node`) options.
62+
The batch engine is requested via the `-e batch` option.
63+
The number of nodes can be specified with the `-N` (i.e `--nodes`) option.
64+
65+
We provide a list of (baked-in) submission batch scripts in the `toolchain/templates` folder.
5066

5167
```console
52-
$ ./mfc.sh run examples/2D_shockbubble/case.py -e batch -N 2 -n 4 -g 4 -t simulation
68+
$ ./mfc.sh run examples/2D_shockbubble/case.py -e batch -N 2 -n 4 -t simulation -c <computer name>
5369
```
5470

5571
Other useful arguments include:
@@ -60,26 +76,8 @@ Other useful arguments include:
6076
- `-a <account name>` to identify the account to be charged for the job. (i.e `--account`)
6177
- `-p <partition name>` to select the job's partition. (i.e `--partition`)
6278

63-
Since some schedulers don't have a standardized syntax to request certain resources, MFC can only provide support for a restricted subset of common configuration options.
64-
If MFC fails to execute on your system, or if you wish to adjust how the program runs and resources are requested to be allocated, you are invited to modify the template batch script for your queue system.
65-
Upon execution of `./mfc.sh run`, MFC fills in the template with runtime parameters, to generate the batch file it will submit.
66-
These files are located in the [templates](https://github.com/MFlowCode/MFC/tree/master/toolchain/templates/) directory.
67-
To request GPUs, modification of the template will be required on most systems.
68-
69-
- Lines that begin with `#>` are ignored and won't figure in the final batch script, not even as a comment.
70-
71-
- Statements of the form `${expression}` are string-replaced to provide runtime parameters, most notably execution options.
72-
You can perform therein any Python operation recognized by the built-in `expr()` function.
73-
7479
As an example, one might request GPUs on a SLURM system using the following:
7580

76-
```
77-
#SBATCH --gpus=v100-32:{gpus_per_node*nodes}
78-
```
79-
80-
- Statements of the form `{MFC::expression}` tell MFC where to place the common code, across all batch files, that is required for proper execution.
81-
They are not intended to be modified by users.
82-
8381
**Disclaimer**: IBM's JSRUN on LSF-managed computers does not use the traditional node-based approach to
8482
allocate resources. Therefore, the MFC constructs equivalent resource-sets in task and GPU count.
8583

@@ -173,13 +171,6 @@ $ ./mfc.sh run examples/1D_vacuum_restart/restart_case.py -t post_process
173171
- Oak Ridge National Laboratory's [Summit](https://www.olcf.ornl.gov/summit/):
174172

175173
```console
176-
$ ./mfc.sh run examples/2D_shockbubble/case.py -e batch \
177-
-N 2 -n 4 -g 4 -t simulation -a <redacted>
178-
```
179-
180-
- University of California, San Diego's [Expanse](https://www.sdsc.edu/services/hpc/expanse/):
181-
182-
```console
183-
$ ./mfc.sh run examples/2D_shockbubble/case.py -e batch -p GPU -t simulation \
184-
-N 2 -n 8 -g 8 -f="--gpus=v100-32:16" -b mpirun –w 00:30:00
174+
$ ./mfc.sh run examples/2D_shockbubble/case.py -e batch \
175+
-N 2 -n 4 -t simulation -a <redacted> -c summit
185176
```

toolchain/mfc/args.py

Lines changed: 28 additions & 31 deletions
Original file line numberDiff line numberDiff line change
@@ -1,12 +1,11 @@
11
import re, os.path, argparse, dataclasses
22

3-
from .build import TARGETS, DEFAULT_TARGETS, DEPENDENCY_TARGETS
4-
from .common import MFCException, format_list_to_string
5-
from .test.cases import generate_cases
6-
from .run.engines import ENGINES
7-
from .run.mpi_bins import BINARIES
3+
from .run.run import get_baked_templates
4+
from .build import TARGETS, DEFAULT_TARGETS, DEPENDENCY_TARGETS
5+
from .common import MFCException, format_list_to_string
6+
from .test.cases import generate_cases
87

9-
# pylint: disable=too-many-locals, too-many-statements
8+
# pylint: disable=too-many-locals, too-many-branches, too-many-statements
109
def parse(config):
1110
parser = argparse.ArgumentParser(
1211
prog="./mfc.sh",
@@ -75,8 +74,6 @@ def add_common_arguments(p, mask = None):
7574
# === CLEAN ===
7675
add_common_arguments(clean, "jg")
7776

78-
binaries = [ b.bin for b in BINARIES ]
79-
8077
# === TEST ===
8178
test_cases = generate_cases()
8279

@@ -100,29 +97,28 @@ def add_common_arguments(p, mask = None):
10097
test.add_argument(metavar="FORWARDED", default=[], dest="--", nargs="*", help="Arguments to forward to the ./mfc.sh run invocations.")
10198

10299
# === RUN ===
103-
engines = [ e.slug for e in ENGINES ]
104-
105100
add_common_arguments(run)
106-
run.add_argument("input", metavar="INPUT", type=str, help="Input file to run.")
107-
run.add_argument("arguments", metavar="ARGUMENTS", nargs="*", type=str, default=[], help="Additional arguments to pass to the case file.")
108-
run.add_argument("-e", "--engine", choices=engines, type=str, default=engines[0], help="Job execution/submission engine choice.")
101+
run.add_argument("input", metavar="INPUT", type=str, help="Input file to run.")
102+
run.add_argument("arguments", metavar="ARGUMENTS", nargs="*", type=str, default=[], help="Additional positional arguments to pass to the case file.")
103+
run.add_argument("-e", "--engine", choices=["interactive", "batch"], type=str, default="interactive", help="Job execution/submission engine choice.")
109104
run.add_argument("--output-summary", type=str, default=None, help="(Interactive) Output a YAML summary file.")
110-
run.add_argument("-p", "--partition", metavar="PARTITION", type=str, default="", help="(Batch) Partition for job submission.")
111-
run.add_argument("-N", "--nodes", metavar="NODES", type=int, default=1, help="(Batch) Number of nodes.")
112-
run.add_argument("-n", "--tasks-per-node", metavar="TASKS", type=int, default=1, help="Number of tasks per node.")
113-
run.add_argument("-w", "--walltime", metavar="WALLTIME", type=str, default="01:00:00", help="(Batch) Walltime.")
114-
run.add_argument("-a", "--account", metavar="ACCOUNT", type=str, default="", help="(Batch) Account to charge.")
115-
run.add_argument("-@", "--email", metavar="EMAIL", type=str, default="", help="(Batch) Email for job notification.")
116-
run.add_argument("-#", "--name", metavar="NAME", type=str, default="MFC", help="(Batch) Job name.")
117-
run.add_argument("-b", "--binary", choices=binaries, type=str, default=None, help="(Interactive) Override MPI execution binary")
118-
run.add_argument("-s", "--scratch", action="store_true", default=False, help="Build from scratch.")
119-
run.add_argument("--ncu", nargs=argparse.REMAINDER, type=str, help="Profile with NVIDIA Nsight Compute.")
120-
run.add_argument("--nsys", nargs=argparse.REMAINDER, type=str, help="Profile with NVIDIA Nsight Systems.")
121-
run.add_argument( "--dry-run", action="store_true", default=False, help="(Batch) Run without submitting batch file.")
122-
run.add_argument("--case-optimization", action="store_true", default=False, help="(GPU Optimization) Compile MFC targets with some case parameters hard-coded.")
123-
run.add_argument( "--no-build", action="store_true", default=False, help="(Testing) Do not rebuild MFC.")
124-
run.add_argument("--wait", action="store_true", default=False, help="(Batch) Wait for the job to finish.")
125-
run.add_argument("-f", "--flags", metavar="FLAGS", dest="--", nargs=argparse.REMAINDER, type=str, default=[], help="(Interactive) Arguments to forward to the MPI invocation.")
105+
run.add_argument("-p", "--partition", metavar="PARTITION", type=str, default="", help="(Batch) Partition for job submission.")
106+
run.add_argument("-q", "--quality_of_service", metavar="QOS", type=str, default="", help="(Batch) Quality of Service for job submission.")
107+
run.add_argument("-N", "--nodes", metavar="NODES", type=int, default=1, help="(Batch) Number of nodes.")
108+
run.add_argument("-n", "--tasks-per-node", metavar="TASKS", type=int, default=1, help="Number of tasks per node.")
109+
run.add_argument("-w", "--walltime", metavar="WALLTIME", type=str, default="01:00:00", help="(Batch) Walltime.")
110+
run.add_argument("-a", "--account", metavar="ACCOUNT", type=str, default="", help="(Batch) Account to charge.")
111+
run.add_argument("-@", "--email", metavar="EMAIL", type=str, default="", help="(Batch) Email for job notification.")
112+
run.add_argument("-#", "--name", metavar="NAME", type=str, default="MFC", help="(Batch) Job name.")
113+
run.add_argument("-s", "--scratch", action="store_true", default=False, help="Build from scratch.")
114+
run.add_argument("--ncu", nargs=argparse.REMAINDER, type=str, help="Profile with NVIDIA Nsight Compute.")
115+
run.add_argument("--nsys", nargs=argparse.REMAINDER, type=str, help="Profile with NVIDIA Nsight Systems.")
116+
run.add_argument( "--dry-run", action="store_true", default=False, help="(Batch) Run without submitting batch file.")
117+
run.add_argument("--case-optimization", action="store_true", default=False, help="(GPU Optimization) Compile MFC targets with some case parameters hard-coded.")
118+
run.add_argument( "--no-build", action="store_true", default=False, help="(Testing) Do not rebuild MFC.")
119+
run.add_argument("--wait", action="store_true", default=False, help="(Batch) Wait for the job to finish.")
120+
run.add_argument("-f", "--flags", metavar="FLAGS", dest="--", nargs=argparse.REMAINDER, type=str, default=[], help="(Interactive) Arguments to forward to the MPI invocation.")
121+
run.add_argument("-c", "--computer", metavar="COMPUTER", type=str, default="default", help=f"(Batch) Path to a custom submission file template or one of {format_list_to_string(list(get_baked_templates().keys()))}.")
126122

127123
# === BENCH ===
128124
add_common_arguments(bench, "t")
@@ -153,10 +149,11 @@ def add_common_arguments(p, mask = None):
153149
# "Slugify" the name of the job
154150
args["name"] = re.sub(r'[\W_]+', '-', args["name"])
155151

156-
# build's --case-optimization and --input depend on each other
152+
# We need to check for some invalid combinations of arguments because of
153+
# the limitations of argparse.
157154
if args["command"] == "build":
158155
if (args["input"] is not None) ^ args["case_optimization"] :
159-
raise MFCException("./mfc.sh build's --case-optimization requires --input")
156+
raise MFCException("./mfc.sh build's --case-optimization and --input must be used together.")
160157

161158
# Input files to absolute paths
162159
for e in ["input", "input1", "input2"]:

toolchain/mfc/build.py

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -101,6 +101,8 @@ def is_buildable(self) -> bool:
101101
return True
102102

103103
def configure(self):
104+
input.load({}).generate_fpp(self)
105+
104106
build_dirpath = self.get_build_dirpath()
105107
cmake_dirpath = self.get_cmake_dirpath()
106108
install_dirpath = self.get_install_dirpath()

toolchain/mfc/common.py

Lines changed: 2 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -5,9 +5,10 @@
55
from .printer import cons
66

77

8-
MFC_ROOTDIR = normpath(f"{dirname(realpath(__file__))}/../..")
8+
MFC_ROOTDIR = abspath(normpath(f"{dirname(realpath(__file__))}/../.."))
99
MFC_TESTDIR = abspath(f"{MFC_ROOTDIR}/tests")
1010
MFC_SUBDIR = abspath(f"{MFC_ROOTDIR}/build")
11+
MFC_TEMPLATEDIR = abspath(f"{MFC_ROOTDIR}/toolchain/templates")
1112
MFC_LOCK_FILEPATH = abspath(f"{MFC_SUBDIR}/lock.yaml")
1213
MFC_BENCH_FILEPATH = abspath(f"{MFC_ROOTDIR}/toolchain/bench.yaml")
1314

@@ -179,14 +180,6 @@ def does_system_use_modules() -> bool:
179180
return does_command_exist("module")
180181

181182

182-
def get_loaded_modules() -> typing.List[str]:
183-
"""
184-
Returns a list of loaded modules.
185-
"""
186-
187-
return [ l for l in subprocess.getoutput("module -t list").splitlines() if ' ' not in l ]
188-
189-
190183
def is_number(x: str) -> bool:
191184
if x is None:
192185
return False

0 commit comments

Comments
 (0)