GPU-Aware MPI on OLCF Frontier and Combined weak- & strong-scaling ca…

…se (#448)
MFlowCode · Jun 4, 2024 · 31aed98 · 31aed98
1 parent daa8e85
commit 31aed98
Show file tree

Hide file tree

Showing 30 changed files with 398 additions and 169 deletions.
diff --git a/.gitignore b/.gitignore
@@ -53,6 +53,8 @@ examples/*/viz/
 examples/*.jpg
 examples/*.png
 examples/*/workloads/
+examples/*/run-*/
+examples/*/logs/
 workloads/
 
 benchmarks/*batch/*/

diff --git a/docs/documentation/case.md b/docs/documentation/case.md
@@ -30,40 +30,39 @@ This is particularly useful when computations are done in Python to generate the
 
 ## (Optional) Accepting command line arguments
 
-Input files can accept **positional** command line arguments, forwarded by `mfc.sh run`.
-Consider this example from the 3D_weak_scaling case:
+Input files can accept command line arguments, forwarded by `mfc.sh run`.
+Consider this example from the `scaling` case:
 
 ```python
 import argparse
 
 parser = argparse.ArgumentParser(
-    prog="3D_weak_scaling",
-    description="This MFC case was created for the purposes of weak scaling.",
+    prog="scaling",
+    description="Weak- and strong-scaling benchmark case.",
     formatter_class=argparse.ArgumentDefaultsHelpFormatter)
 
-parser.add_argument("dict", type=str, metavar="DICT", help=argparse.SUPPRESS)
-parser.add_argument("gbpp", type=int, metavar="MEM", default=16, help="Adjusts the problem size per rank to fit into [MEM] GB of GPU memory")
+parser.add_argument("dict", type=str, metavar="DICT")
+parser.add_argument("-s", "--scaling",  type=str, metavar="SCALING",  choices=["weak", "strong"], help="Whether weak- or strong-scaling is being exercised.")
 
 # Your parsed arguments are here
-ARGS = vars(parser.parse_args())
+args = parser.parse_args()
 ```
 
 The first argument is always a JSON string representing `mfc.sh run`'s internal
 state.
 It contains all the runtime information you might want from the build/run system.
-We hide it from the help menu with `help=argparse.SUPPRESS` since it is not meant to be passed in by users.
-You can add as many additional positional arguments as you may need.
+You can add as many additional arguments as you may need.
 
 To run such a case, use the following format:
 
 ```shell
-./mfc.sh run <path/to/case.py> <positional arguments> <regular mfc.sh run arguments>
+./mfc.sh run <path/to/case.py> <mfc.sh run arguments> -- <case arguments>
 ```
 
-For example, to run the 3D_weak_scaling case with `gbpp=2`:
+For example, to run the `scaling` case in "weak-scaling" mode:
 
 ```shell
-./mfc.sh run examples/3D_weak_scaling/case.py 2 -t pre_process -j 8
+./mfc.sh run examples/scaling/case.py -t pre_process -j 8 -- --scaling weak
 ```
 
 ## Parameters
@@ -87,11 +86,15 @@ Definition of the parameters is described in the following subsections.
 
 ### 1. Runtime
 
-| Parameter        | Type           | Description                      |
-| ---:             |    :----:      |          :---                    |
-| `run_time_info`  | Logical        | Output run-time information      |
+| Parameter        | Type           | Description                               |
+| ---:             |    :----:      |          :---                             |
+| `run_time_info`  | Logical        | Output run-time information               |
+| `rdma_mpi`       | Logical        | (GPUs) Enable RDMA for MPI communication. |
 
 - `run_time_info` generates a text file that includes run-time information including the CFL number(s) at each time-step.
+- `rdma_mpi` optimizes data transfers between GPUs using Remote Direct Memory Access (RDMA).
+The underlying MPI implementation and communication infrastructure must support this
+feature, detecting GPU pointers and performing RDMA accordingly.
 
 ### 2. Computational Domain
 

diff --git a/docs/documentation/running.md b/docs/documentation/running.md
@@ -24,8 +24,6 @@ several supercomputer clusters, both interactively and through batch submission.
 >
 > If `-c <computer name>` is left unspecified, it defaults to `-c default`.
 
-Additional flags can be appended to the MPI executable call using the `-f` (i.e `--flags`) option.
-
 Please refer to `./mfc.sh run -h` for a complete list of arguments and options, along with their defaults.
 
 ## Interactive Execution

diff --git a/examples/2D_whale_bubble_annulus/case.py b/examples/2D_whale_bubble_annulus/case.py
@@ -193,6 +193,5 @@
     'Mono(1)%pulse'                : 1,
     'Mono(1)%mag'                  : 1.,
     'Mono(1)%length'               : 0.2,
-    'cu_mpi'                       : 'F',
-
+    'rdma_mpi'                     : 'F',
 }))
diff --git a/examples/3D_weak_scaling/README.md b/examples/3D_weak_scaling/README.md
diff --git a/examples/3D_weak_scaling/analyze.sh b/examples/3D_weak_scaling/analyze.sh
diff --git a/examples/scaling/README.md b/examples/scaling/README.md
@@ -0,0 +1,33 @@
+# Strong- & Weak-scaling
+
+The [**Scaling**](case.py) case can exercise both weak- and strong-scaling. It
+adjusts itself depending on the number of requested ranks.
+
+This directory also contains a collection of scripts used to test strong-scaling
+on OLCF Frontier. They required modifying MFC to collect some metrics but are
+meant to serve as a reference to users wishing to run similar experiments.
+
+## Weak Scaling
+
+Pass `--scaling weak`. The `--memory` option controls (approximately) how much
+memory each rank should use, in Gigabytes. The number of cells in each dimension
+is then adjusted according to the number of requested ranks and an approximation
+for the relation between cell count and memory usage. The problem size increases
+linearly with the number of ranks.
+
+## Strong Scaling
+
+Pass `--scaling strong`. The `--memory` option controls (approximately) how much
+memory should be used in total during simulation, across all ranks, in Gigabytes.
+The problem size remains constant as the number of ranks increases.
+
+## Example
+
+For example, to run a weak-scaling test that uses ~4GB of GPU memory per rank
+on 8 2-rank nodes with case optimization, one could:
+
+```shell
+./mfc.sh run examples/scaling/case.py -t pre_process simulation                    \
+             -e batch -p mypartition -N 8 -n 2 -w "01:00:00" -# "MFC Weak Scaling" \
+             --case-optimization -j 32 -- --scaling weak --memory 4
+```
diff --git a/examples/scaling/build.sh b/examples/scaling/build.sh
@@ -0,0 +1,4 @@
+#!/bin/bash
+
+./mfc.sh build -t pre_process simulation --case-optimization -i examples/scaling/case.py \
+               -j 8 --gpu --mpi --no-debug -- -s strong -m 512
diff --git a/examples/3D_weak_scaling/case.py → examples/scaling/case.py b/examples/3D_weak_scaling/case.py → examples/scaling/case.py
@@ -1,28 +1,45 @@
 #!/usr/bin/env python3
 
-# Case file contributed by Anand Radhakrishnan and modified by Henry Le Berre
-# for integration as a weak scaling benchmark for MFC.
-
-import json, math, argparse
+import sys, json, math, typing, argparse
 
 parser = argparse.ArgumentParser(
-    prog="3D_weak_scaling",
-    description="This MFC case was created for the purposes of weak scaling.",
+    prog="scaling",
+    description="Weak- and strong-scaling benchmark case.",
     formatter_class=argparse.ArgumentDefaultsHelpFormatter)
 
-parser.add_argument("dict", type=str, metavar="DICT", help=argparse.SUPPRESS)
-parser.add_argument("gbpp", type=int, metavar="MEM", default=16, help="Adjusts the problem size per rank to fit into [MEM] GB of GPU memory per GPU.")
+parser.add_argument("dict", type=str, metavar="DICT")
+parser.add_argument("-s", "--scaling",       type=str, metavar="SCALING", choices=["weak", "strong"], help="Whether weak- or strong-scaling is being exercised.")
+parser.add_argument("-m", "--memory",        type=int, metavar="MEMORY",  help="Weak scaling: memory per rank in GB. Strong scaling: global memory in GB. Used to determine cell count.")
+parser.add_argument("-f", "--fidelity",      type=str, metavar="FIDELITY", choices=["ideal", "exact"], default="ideal")
+parser.add_argument("--rdma_mpi",            type=str, metavar="FIDELITY", choices=["T", "F"], default="F")
+parser.add_argument("--n-steps",             type=int, metavar="N", default=None)
+
+args = parser.parse_args()
+
+if args.scaling is None:
+    parser.print_help()
+    sys.exit(1)
+
+DICT = json.loads(args.dict)
 
-ARGS = vars(parser.parse_args())
-DICT = json.loads(ARGS["dict"])
+# \approx The number of cells per GB of memory. The exact value is not important.
+cpg    = 8000000 / 16.0
+# Number of ranks.
+nranks = DICT["nodes"] * DICT["tasks_per_node"]
 
-ppg    = 8000000 / 16.0
-procs  = DICT["nodes"] * DICT["tasks_per_node"]
-ncells = math.floor(ppg * procs * ARGS["gbpp"])
-s      = math.floor((ncells / 2.0) ** (1/3))
-Nx, Ny, Nz = 2*s, s, s
+def nxyz_from_ncells(ncells: float) -> typing.Tuple[int, int, int]:
+    s = math.floor((ncells / 2.0) ** (1/3))
+    return 2*s, s, s
 
-# athmospheric pressure - Pa (used as reference value)
+if args.scaling == "weak":
+    if args.fidelity == "ideal":
+        raise RuntimeError("ask ben")
+    else:
+        Nx, Ny, Nz = nxyz_from_ncells(cpg * nranks * args.memory)
+else:
+    Nx, Ny, Nz = nxyz_from_ncells(cpg * args.memory)
+
+# Atmospheric pressure - Pa (used as reference value)
 patm = 101325
 
 # Initial Droplet Diameter / Reference length - m
@@ -162,7 +179,8 @@
 AS = int( NtA // SF + 1 )
 
 # Nt = total number of steps. Note that Nt >= NtA (so at least tendA is completely simulated)
-Nt = AS * SF
+Nt = args.n_steps or (AS * SF)
+SF = min( SF, Nt )
 
 # total simulation time - s. Note that tend >= tendA
 tend = Nt * dt
@@ -171,6 +189,7 @@
 print(json.dumps({
     # Logistics ================================================
     'run_time_info'                : 'T',
+    'rdma_mpi'                     : args.rdma_mpi,
     # ==========================================================
 
     # Computational Domain Parameters ==========================
@@ -186,8 +205,8 @@
     'cyl_coord'                    : 'F',
     'dt'                           : dt,
     't_step_start'                 : 0,
-    't_step_stop'                  : int(5000*16.0/ARGS["gbpp"]),
-    't_step_save'                  : int(1000*16.0/ARGS["gbpp"]),
+    't_step_stop'                  : Nt,
+    't_step_save'                  : SF,
     # ==========================================================
 
     # Simulation Algorithm Parameters ==========================
@@ -201,7 +220,7 @@
     'time_stepper'                 : 3,
     'weno_order'                   : 3,
     'weno_eps'                     : 1.0E-16,
-    'weno_Re_flux'                 : 'F',  
+    'weno_Re_flux'                 : 'F',
     'weno_avg'                     : 'F',
     'mapped_weno'                  : 'T',
     'riemann_solver'               : 2,
@@ -283,6 +302,3 @@
     'fluid_pp(2)%pi_inf'           : gama*pia/(gama-1),
     # ==========================================================
 }))
-
-# ==============================================================================
-
diff --git a/examples/scaling/export.py b/examples/scaling/export.py
@@ -0,0 +1,90 @@
+import re, os, csv, glob, statistics
+
+from dataclasses import dataclass, fields
+
+CDIR=os.path.abspath(os.path.join("examples", "scaling"))
+LDIR=os.path.join(CDIR, "logs")
+
+def get_num(s: str) -> float:
+    try:
+        return float(re.findall(r"[0-9]+\.[0-9]+(?:E[-+][0-9]+)?", s, re.MULTILINE)[0])
+    except:
+        return None
+
+def get_nums(arr):
+    return {get_num(_) for _ in arr if get_num(_)}
+
+@dataclass(frozen=True, order=True)
+class Configuration:
+    nodes:    int
+    mem:      int
+    rdma_mpi: bool
+
+@dataclass
+class Result:
+    ts_avg:  float
+    mpi_avg: float
+    init_t:  float
+    sim_t:   float
+
+runs = {}
+
+for logpath in glob.glob(os.path.join(LDIR, "run-*-sim*")):
+    logdata = open(logpath, "r").read()
+
+    tss  = get_nums(re.findall(r'^ TS .+', logdata, re.MULTILINE))
+    mpis = get_nums(re.findall(r'^ MPI .+', logdata, re.MULTILINE))
+    try:
+        perf = get_num(re.findall(r"^ Performance: .+", logdata, re.MULTILINE)[0])
+    except:
+        perf = 'N/A'
+
+    if len(tss)  == 0: tss  = [-1.0]
+    if len(mpis) == 0: mpis = [-1.0]
+
+    pathels = os.path.relpath(logpath, LDIR).split('-')
+
+    runs[Configuration(
+        nodes=int(pathels[1]),
+        mem=int(pathels[2]),
+        rdma_mpi=pathels[3] == 'T'
+    )] = Result(
+        ts_avg=statistics.mean(tss),
+        mpi_avg=statistics.mean(mpis),
+        init_t=get_num(re.findall(r"Init took .+", logdata, re.MULTILINE)[0]),
+        sim_t=get_num(re.findall(r"sim_duration .+", logdata, re.MULTILINE)[0]),
+    )
+
+with open(os.path.join(CDIR, "export.csv"), "w") as f:
+    writer = csv.writer(f, delimiter=',')
+    writer.writerow([
+        _.name for _ in fields(Configuration) + fields(Result)
+    ])
+
+    for cfg in sorted(runs.keys()):
+        writer.writerow(
+            [ getattr(cfg,       _.name) for _ in fields(Configuration) ] +
+            [ getattr(runs[cfg], _.name) for _ in fields(Result) ] 
+        )
+
+for rdma_mpi in (False, True):
+    with open(
+        os.path.join(CDIR, f"strong_scaling{'-rdma_mpi' if rdma_mpi else ''}.csv"),
+        "w"
+    ) as f:
+        writer = csv.writer(f, delimiter=',')
+
+        for nodes in sorted({
+            _.nodes for _ in runs.keys() if _.rdma_mpi == rdma_mpi
+        }):
+            row = (nodes*8,)
+            for mem in sorted({
+                _.mem for _ in runs.keys() if _.nodes == nodes and _.rdma_mpi == rdma_mpi
+            }, reverse=True):
+                ref = runs[Configuration(nodes=sorted({
+                    _.nodes for _ in runs.keys() if _.rdma_mpi == rdma_mpi
+                })[0], mem=mem, rdma_mpi=rdma_mpi)]
+                run = runs[Configuration(nodes=nodes, mem=mem, rdma_mpi=rdma_mpi)]
+                row = (*row,run.sim_t,ref.sim_t/nodes)
+
+            writer.writerow(row)