Skip to content

Commit

Permalink
GPU-Aware MPI on OLCF Frontier and Combined weak- & strong-scaling ca…
Browse files Browse the repository at this point in the history
…se (#448)
  • Loading branch information
henryleberre authored Jun 4, 2024
1 parent daa8e85 commit 31aed98
Show file tree
Hide file tree
Showing 30 changed files with 398 additions and 169 deletions.
2 changes: 2 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -53,6 +53,8 @@ examples/*/viz/
examples/*.jpg
examples/*.png
examples/*/workloads/
examples/*/run-*/
examples/*/logs/
workloads/

benchmarks/*batch/*/
Expand Down
33 changes: 18 additions & 15 deletions docs/documentation/case.md
Original file line number Diff line number Diff line change
Expand Up @@ -30,40 +30,39 @@ This is particularly useful when computations are done in Python to generate the

## (Optional) Accepting command line arguments

Input files can accept **positional** command line arguments, forwarded by `mfc.sh run`.
Consider this example from the 3D_weak_scaling case:
Input files can accept command line arguments, forwarded by `mfc.sh run`.
Consider this example from the `scaling` case:

```python
import argparse

parser = argparse.ArgumentParser(
prog="3D_weak_scaling",
description="This MFC case was created for the purposes of weak scaling.",
prog="scaling",
description="Weak- and strong-scaling benchmark case.",
formatter_class=argparse.ArgumentDefaultsHelpFormatter)

parser.add_argument("dict", type=str, metavar="DICT", help=argparse.SUPPRESS)
parser.add_argument("gbpp", type=int, metavar="MEM", default=16, help="Adjusts the problem size per rank to fit into [MEM] GB of GPU memory")
parser.add_argument("dict", type=str, metavar="DICT")
parser.add_argument("-s", "--scaling", type=str, metavar="SCALING", choices=["weak", "strong"], help="Whether weak- or strong-scaling is being exercised.")

# Your parsed arguments are here
ARGS = vars(parser.parse_args())
args = parser.parse_args()
```

The first argument is always a JSON string representing `mfc.sh run`'s internal
state.
It contains all the runtime information you might want from the build/run system.
We hide it from the help menu with `help=argparse.SUPPRESS` since it is not meant to be passed in by users.
You can add as many additional positional arguments as you may need.
You can add as many additional arguments as you may need.

To run such a case, use the following format:

```shell
./mfc.sh run <path/to/case.py> <positional arguments> <regular mfc.sh run arguments>
./mfc.sh run <path/to/case.py> <mfc.sh run arguments> -- <case arguments>
```

For example, to run the 3D_weak_scaling case with `gbpp=2`:
For example, to run the `scaling` case in "weak-scaling" mode:

```shell
./mfc.sh run examples/3D_weak_scaling/case.py 2 -t pre_process -j 8
./mfc.sh run examples/scaling/case.py -t pre_process -j 8 -- --scaling weak
```

## Parameters
Expand All @@ -87,11 +86,15 @@ Definition of the parameters is described in the following subsections.

### 1. Runtime

| Parameter | Type | Description |
| ---: | :----: | :--- |
| `run_time_info` | Logical | Output run-time information |
| Parameter | Type | Description |
| ---: | :----: | :--- |
| `run_time_info` | Logical | Output run-time information |
| `rdma_mpi` | Logical | (GPUs) Enable RDMA for MPI communication. |

- `run_time_info` generates a text file that includes run-time information including the CFL number(s) at each time-step.
- `rdma_mpi` optimizes data transfers between GPUs using Remote Direct Memory Access (RDMA).
The underlying MPI implementation and communication infrastructure must support this
feature, detecting GPU pointers and performing RDMA accordingly.

### 2. Computational Domain

Expand Down
2 changes: 0 additions & 2 deletions docs/documentation/running.md
Original file line number Diff line number Diff line change
Expand Up @@ -24,8 +24,6 @@ several supercomputer clusters, both interactively and through batch submission.
>
> If `-c <computer name>` is left unspecified, it defaults to `-c default`.
Additional flags can be appended to the MPI executable call using the `-f` (i.e `--flags`) option.

Please refer to `./mfc.sh run -h` for a complete list of arguments and options, along with their defaults.

## Interactive Execution
Expand Down
3 changes: 1 addition & 2 deletions examples/2D_whale_bubble_annulus/case.py
Original file line number Diff line number Diff line change
Expand Up @@ -193,6 +193,5 @@
'Mono(1)%pulse' : 1,
'Mono(1)%mag' : 1.,
'Mono(1)%length' : 0.2,
'cu_mpi' : 'F',

'rdma_mpi' : 'F',
}))
24 changes: 0 additions & 24 deletions examples/3D_weak_scaling/README.md

This file was deleted.

5 changes: 0 additions & 5 deletions examples/3D_weak_scaling/analyze.sh

This file was deleted.

33 changes: 33 additions & 0 deletions examples/scaling/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,33 @@
# Strong- & Weak-scaling

The [**Scaling**](case.py) case can exercise both weak- and strong-scaling. It
adjusts itself depending on the number of requested ranks.

This directory also contains a collection of scripts used to test strong-scaling
on OLCF Frontier. They required modifying MFC to collect some metrics but are
meant to serve as a reference to users wishing to run similar experiments.

## Weak Scaling

Pass `--scaling weak`. The `--memory` option controls (approximately) how much
memory each rank should use, in Gigabytes. The number of cells in each dimension
is then adjusted according to the number of requested ranks and an approximation
for the relation between cell count and memory usage. The problem size increases
linearly with the number of ranks.

## Strong Scaling

Pass `--scaling strong`. The `--memory` option controls (approximately) how much
memory should be used in total during simulation, across all ranks, in Gigabytes.
The problem size remains constant as the number of ranks increases.

## Example

For example, to run a weak-scaling test that uses ~4GB of GPU memory per rank
on 8 2-rank nodes with case optimization, one could:

```shell
./mfc.sh run examples/scaling/case.py -t pre_process simulation \
-e batch -p mypartition -N 8 -n 2 -w "01:00:00" -# "MFC Weak Scaling" \
--case-optimization -j 32 -- --scaling weak --memory 4
```
4 changes: 4 additions & 0 deletions examples/scaling/build.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
#!/bin/bash

./mfc.sh build -t pre_process simulation --case-optimization -i examples/scaling/case.py \
-j 8 --gpu --mpi --no-debug -- -s strong -m 512
62 changes: 39 additions & 23 deletions examples/3D_weak_scaling/case.py → examples/scaling/case.py
Original file line number Diff line number Diff line change
@@ -1,28 +1,45 @@
#!/usr/bin/env python3

# Case file contributed by Anand Radhakrishnan and modified by Henry Le Berre
# for integration as a weak scaling benchmark for MFC.

import json, math, argparse
import sys, json, math, typing, argparse

parser = argparse.ArgumentParser(
prog="3D_weak_scaling",
description="This MFC case was created for the purposes of weak scaling.",
prog="scaling",
description="Weak- and strong-scaling benchmark case.",
formatter_class=argparse.ArgumentDefaultsHelpFormatter)

parser.add_argument("dict", type=str, metavar="DICT", help=argparse.SUPPRESS)
parser.add_argument("gbpp", type=int, metavar="MEM", default=16, help="Adjusts the problem size per rank to fit into [MEM] GB of GPU memory per GPU.")
parser.add_argument("dict", type=str, metavar="DICT")
parser.add_argument("-s", "--scaling", type=str, metavar="SCALING", choices=["weak", "strong"], help="Whether weak- or strong-scaling is being exercised.")
parser.add_argument("-m", "--memory", type=int, metavar="MEMORY", help="Weak scaling: memory per rank in GB. Strong scaling: global memory in GB. Used to determine cell count.")
parser.add_argument("-f", "--fidelity", type=str, metavar="FIDELITY", choices=["ideal", "exact"], default="ideal")
parser.add_argument("--rdma_mpi", type=str, metavar="FIDELITY", choices=["T", "F"], default="F")
parser.add_argument("--n-steps", type=int, metavar="N", default=None)

args = parser.parse_args()

if args.scaling is None:
parser.print_help()
sys.exit(1)

DICT = json.loads(args.dict)

ARGS = vars(parser.parse_args())
DICT = json.loads(ARGS["dict"])
# \approx The number of cells per GB of memory. The exact value is not important.
cpg = 8000000 / 16.0
# Number of ranks.
nranks = DICT["nodes"] * DICT["tasks_per_node"]

ppg = 8000000 / 16.0
procs = DICT["nodes"] * DICT["tasks_per_node"]
ncells = math.floor(ppg * procs * ARGS["gbpp"])
s = math.floor((ncells / 2.0) ** (1/3))
Nx, Ny, Nz = 2*s, s, s
def nxyz_from_ncells(ncells: float) -> typing.Tuple[int, int, int]:
s = math.floor((ncells / 2.0) ** (1/3))
return 2*s, s, s

# athmospheric pressure - Pa (used as reference value)
if args.scaling == "weak":
if args.fidelity == "ideal":
raise RuntimeError("ask ben")
else:
Nx, Ny, Nz = nxyz_from_ncells(cpg * nranks * args.memory)
else:
Nx, Ny, Nz = nxyz_from_ncells(cpg * args.memory)

# Atmospheric pressure - Pa (used as reference value)
patm = 101325

# Initial Droplet Diameter / Reference length - m
Expand Down Expand Up @@ -162,7 +179,8 @@
AS = int( NtA // SF + 1 )

# Nt = total number of steps. Note that Nt >= NtA (so at least tendA is completely simulated)
Nt = AS * SF
Nt = args.n_steps or (AS * SF)
SF = min( SF, Nt )

# total simulation time - s. Note that tend >= tendA
tend = Nt * dt
Expand All @@ -171,6 +189,7 @@
print(json.dumps({
# Logistics ================================================
'run_time_info' : 'T',
'rdma_mpi' : args.rdma_mpi,
# ==========================================================

# Computational Domain Parameters ==========================
Expand All @@ -186,8 +205,8 @@
'cyl_coord' : 'F',
'dt' : dt,
't_step_start' : 0,
't_step_stop' : int(5000*16.0/ARGS["gbpp"]),
't_step_save' : int(1000*16.0/ARGS["gbpp"]),
't_step_stop' : Nt,
't_step_save' : SF,
# ==========================================================

# Simulation Algorithm Parameters ==========================
Expand All @@ -201,7 +220,7 @@
'time_stepper' : 3,
'weno_order' : 3,
'weno_eps' : 1.0E-16,
'weno_Re_flux' : 'F',
'weno_Re_flux' : 'F',
'weno_avg' : 'F',
'mapped_weno' : 'T',
'riemann_solver' : 2,
Expand Down Expand Up @@ -283,6 +302,3 @@
'fluid_pp(2)%pi_inf' : gama*pia/(gama-1),
# ==========================================================
}))

# ==============================================================================

90 changes: 90 additions & 0 deletions examples/scaling/export.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,90 @@
import re, os, csv, glob, statistics

from dataclasses import dataclass, fields

CDIR=os.path.abspath(os.path.join("examples", "scaling"))
LDIR=os.path.join(CDIR, "logs")

def get_num(s: str) -> float:
try:
return float(re.findall(r"[0-9]+\.[0-9]+(?:E[-+][0-9]+)?", s, re.MULTILINE)[0])
except:
return None

def get_nums(arr):
return {get_num(_) for _ in arr if get_num(_)}

@dataclass(frozen=True, order=True)
class Configuration:
nodes: int
mem: int
rdma_mpi: bool

@dataclass
class Result:
ts_avg: float
mpi_avg: float
init_t: float
sim_t: float

runs = {}

for logpath in glob.glob(os.path.join(LDIR, "run-*-sim*")):
logdata = open(logpath, "r").read()

tss = get_nums(re.findall(r'^ TS .+', logdata, re.MULTILINE))
mpis = get_nums(re.findall(r'^ MPI .+', logdata, re.MULTILINE))
try:
perf = get_num(re.findall(r"^ Performance: .+", logdata, re.MULTILINE)[0])
except:
perf = 'N/A'

if len(tss) == 0: tss = [-1.0]
if len(mpis) == 0: mpis = [-1.0]

pathels = os.path.relpath(logpath, LDIR).split('-')

runs[Configuration(
nodes=int(pathels[1]),
mem=int(pathels[2]),
rdma_mpi=pathels[3] == 'T'
)] = Result(
ts_avg=statistics.mean(tss),
mpi_avg=statistics.mean(mpis),
init_t=get_num(re.findall(r"Init took .+", logdata, re.MULTILINE)[0]),
sim_t=get_num(re.findall(r"sim_duration .+", logdata, re.MULTILINE)[0]),
)

with open(os.path.join(CDIR, "export.csv"), "w") as f:
writer = csv.writer(f, delimiter=',')
writer.writerow([
_.name for _ in fields(Configuration) + fields(Result)
])

for cfg in sorted(runs.keys()):
writer.writerow(
[ getattr(cfg, _.name) for _ in fields(Configuration) ] +
[ getattr(runs[cfg], _.name) for _ in fields(Result) ]
)

for rdma_mpi in (False, True):
with open(
os.path.join(CDIR, f"strong_scaling{'-rdma_mpi' if rdma_mpi else ''}.csv"),
"w"
) as f:
writer = csv.writer(f, delimiter=',')

for nodes in sorted({
_.nodes for _ in runs.keys() if _.rdma_mpi == rdma_mpi
}):
row = (nodes*8,)
for mem in sorted({
_.mem for _ in runs.keys() if _.nodes == nodes and _.rdma_mpi == rdma_mpi
}, reverse=True):
ref = runs[Configuration(nodes=sorted({
_.nodes for _ in runs.keys() if _.rdma_mpi == rdma_mpi
})[0], mem=mem, rdma_mpi=rdma_mpi)]
run = runs[Configuration(nodes=nodes, mem=mem, rdma_mpi=rdma_mpi)]
row = (*row,run.sim_t,ref.sim_t/nodes)

writer.writerow(row)
Loading

0 comments on commit 31aed98

Please sign in to comment.