Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New Runner Group #746

Closed
wants to merge 43 commits into from
Closed
Show file tree
Hide file tree
Changes from 38 commits
Commits
Show all changes
43 commits
Select commit Hold shift + click to select a range
f40b344
ice runner
sbryngelson Nov 25, 2024
6a46d76
bump
sbryngelson Nov 25, 2024
f62bd6e
fix
sbryngelson Nov 25, 2024
8623c31
fix!
sbryngelson Nov 25, 2024
1387951
add bench!
sbryngelson Nov 25, 2024
9b82abd
allow for more frontier runners
sbryngelson Nov 25, 2024
1d48baf
make self hosted run?
sbryngelson Nov 25, 2024
88b3478
fix?
sbryngelson Nov 25, 2024
d5361fb
try again
sbryngelson Nov 25, 2024
d584a6f
fix
sbryngelson Nov 25, 2024
5fe721c
make things better
sbryngelson Nov 26, 2024
ca2a3c3
make this thing work and steal from henry while im at it
sbryngelson Nov 26, 2024
7d8f81e
clean
sbryngelson Nov 26, 2024
4064331
Update spelling.yml
sbryngelson Nov 26, 2024
e8bd9ab
Merge branch 'master' into runners
sbryngelson Nov 26, 2024
9af4d3b
use mfc.sh init because it is useful
sbryngelson Nov 26, 2024
cef04fd
Merge branch 'master' into runners
sbryngelson Nov 26, 2024
4c54d39
clean up
sbryngelson Nov 26, 2024
de267d6
Update submit.sh
sbryngelson Nov 26, 2024
2f33f55
add delta runner
sbryngelson Nov 26, 2024
5e1caec
more runner
sbryngelson Nov 26, 2024
f00fceb
fix
sbryngelson Nov 26, 2024
32f6b69
Update submit.sh
sbryngelson Nov 27, 2024
468deae
fix me
sbryngelson Nov 27, 2024
5548162
Update submit.sh
sbryngelson Nov 27, 2024
be6f1ed
fix?
sbryngelson Nov 27, 2024
64c4626
Merge branch 'runners' of https://github.com/sbryngelson/MFC into run…
sbryngelson Nov 27, 2024
e90ffb8
Update submit.sh
sbryngelson Dec 7, 2024
3eb1706
Update test.sh
sbryngelson Dec 7, 2024
b97c295
Update test.sh
sbryngelson Dec 7, 2024
1475de1
Update test.sh
sbryngelson Dec 7, 2024
a771f6a
Update test.yml
sbryngelson Dec 7, 2024
0b40a72
Update test.sh
sbryngelson Dec 9, 2024
cc61d9e
Update build.sh
sbryngelson Dec 9, 2024
66a4faf
Update p_main.f90
sbryngelson Dec 9, 2024
cb3f9f2
Update submit.sh
sbryngelson Dec 9, 2024
86df0c9
Update run.py
sbryngelson Dec 10, 2024
e4ec936
Merge branch 'master' into runners
sbryngelson Dec 10, 2024
c4bf861
Discard changes to .github/workflows/formatting.yml
sbryngelson Dec 16, 2024
bf47549
Discard changes to .github/workflows/frontier/test.sh
sbryngelson Dec 16, 2024
92ea3a8
Discard changes to .github/workflows/line-count.yml
sbryngelson Dec 16, 2024
f7c9f89
Discard changes to src/pre_process/p_main.f90
sbryngelson Dec 16, 2024
6bd4d3f
Merge branch 'master' into runners
sbryngelson Dec 16, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 4 additions & 2 deletions .github/workflows/bench.yml
Original file line number Diff line number Diff line change
Expand Up @@ -25,9 +25,10 @@ jobs:
strategy:
matrix:
device: ['cpu', 'gpu']
lbl: ['gt']
runs-on:
group: phoenix
labels: gt
labels: ${{ matrix.lbl }}
timeout-minutes: 1400
env:
ACTIONS_RUNNER_FORCE_ACTIONS_NODE_VERSION: node16
Expand All @@ -46,6 +47,7 @@ jobs:
path: master

- name: Bench (Master v. PR)
if: matrix.lbl == 'phoenix'
run: |
(cd pr && bash .github/workflows/phoenix/submit.sh .github/workflows/phoenix/bench.sh ${{ matrix.device }}) &
(cd master && bash .github/workflows/phoenix/submit.sh .github/workflows/phoenix/bench.sh ${{ matrix.device }}) &
Expand All @@ -60,7 +62,7 @@ jobs:
uses: actions/upload-artifact@v4
if: always()
with:
name: logs-${{ matrix.device }}
name: logs-${{ matrix.device }}-${{matrix.lbl}}
path: |
pr/bench-${{ matrix.device }}.*
pr/build/benchmarks/*
Expand Down
63 changes: 63 additions & 0 deletions .github/workflows/delta/submit.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,63 @@
#!/bin/bash

set -e

usage() {
echo "Usage: $0 [script.sh] [cpu|gpu]"
}

if [ ! -z "$1" ]; then
sbatch_script_contents=`cat $1`
else
usage
exit 1
fi

sbatch_cpu_opts="\
#SBATCH -p cpu
#SBATCH --account=bdiy-delta-cpu
"

sbatch_gpu_opts="\
#SBATCH -p gpuA100x4,gpuA100x8
#SBATCH --account=bdiy-delta-gpu
#SBATCH --gpus-per-node=4
"

if [ "$2" == "cpu" ]; then
sbatch_device_opts="$sbatch_cpu_opts"
elif [ "$2" == "gpu" ]; then
sbatch_device_opts="$sbatch_gpu_opts"
else
usage
exit 1
fi

job_slug="`basename "$1" | sed 's/\.sh$//' | sed 's/[^a-zA-Z0-9]/-/g'`-$2"

sbatch <<EOT
#!/bin/bash
#SBATCH -Jshb-$job_slug # Job name
#SBATCH -N1 # Number of nodes required
$sbatch_device_opts
#SBATCH -t 03:00:00 # Duration of the job (Ex: 15 mins)
#SBATCH -n 20
#SBATCH -o$job_slug.out # Combined output and error messages file
#SBATCH --constraint="scratch"
#SBATCH -W # Do not exit until the submitted job terminates.

set -e
set -x

cd "\$SLURM_SUBMIT_DIR"
echo "Running in $(pwd):"

job_slug="$job_slug"
job_device="$2"

. ./mfc.sh load -c d -m $2

$sbatch_script_contents

EOT

20 changes: 20 additions & 0 deletions .github/workflows/delta/test.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
#!/bin/bash

build_opts=""
if [ "$job_device" == "gpu" ]; then
build_opts="--gpu"
fi

n_build_threads=20
n_test_threads=20

if [ "$job_device" == "gpu" ]; then
gpu_count=$(nvidia-smi -L | wc -l) # number of GPUs on node
gpu_ids=$(seq -s ' ' 0 $(($gpu_count-1))) # 0,1,2,...,gpu_count-1
device_opts="-g $gpu_ids"
n_test_threads=`expr $gpu_count \* 2`
fi

export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/sw/spack/deltas11-2023-03/apps/linux-rhel8-zen3/nvhpc-24.1/openmpi-4.1.5-zkiklxi/lib/
./mfc.sh build -j $n_build_threads $build_opts
./mfc.sh test --max-attempts 3 -a -j $n_test_threads $device_opts $build_opts --no-build -- -c delta
3 changes: 3 additions & 0 deletions .github/workflows/formatting.yml
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,9 @@ jobs:
steps:
- uses: actions/checkout@v4

- name: MFC Python setup
run: ./mfc.sh init

- name: Check formatting
run: |
./mfc.sh format -j $(nproc)
Expand Down
1 change: 0 additions & 1 deletion .github/workflows/frontier/test.sh
Original file line number Diff line number Diff line change
Expand Up @@ -4,4 +4,3 @@ gpus=`rocm-smi --showid | awk '{print $1}' | grep -Eo '[0-9]+' | uniq | tr '\n'
ngpus=`echo "$gpus" | tr -d '[:space:]' | wc -c`

./mfc.sh test --max-attempts 3 -j $ngpus -- -c frontier

15 changes: 15 additions & 0 deletions .github/workflows/ice/bench.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
#!/bin/bash

n_ranks=12

if [ "$job_device" == "gpu" ]; then
n_ranks=$(nvidia-smi -L | wc -l) # number of GPUs on node
gpu_ids=$(seq -s ' ' 0 $(($n_ranks-1))) # 0,1,2,...,gpu_count-1
device_opts="--gpu -g $gpu_ids"
fi

if ["$job_device" == "gpu"]; then
./mfc.sh bench --mem 12 -j $(nproc) -o "$job_slug.yaml" -- -c phoenix $device_opts -n $n_ranks
else
./mfc.sh bench --mem 1 -j $(nproc) -o "$job_slug.yaml" -- -c phoenix $device_opts -n $n_ranks
fi
61 changes: 61 additions & 0 deletions .github/workflows/ice/submit.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,61 @@
#!/bin/bash

set -e

usage() {
echo "Usage: $0 [script.sh] [cpu|gpu]"
}

if [ ! -z "$1" ]; then
sbatch_script_contents=`cat $1`
else
usage
exit 1
fi

sbatch_cpu_opts="\
#SBATCH --ntasks-per-node=20 # Number of cores per node required
"

sbatch_gpu_opts="\
#SBATCH --ntasks-per-node=20 # Number of cores per node required
#SBATCH -G H100:2\
"

if [ "$2" == "cpu" ]; then
sbatch_device_opts="$sbatch_cpu_opts"
elif [ "$2" == "gpu" ]; then
sbatch_device_opts="$sbatch_gpu_opts"
else
usage
exit 1
fi

job_slug="`basename "$1" | sed 's/\.sh$//' | sed 's/[^a-zA-Z0-9]/-/g'`-$2"

sbatch <<EOT
#!/bin/bash
#SBATCH -Jshb-$job_slug # Job name
#SBATCH -N1 # Number of nodes required
#SBATCH -n 20 # Number of nodes required
$sbatch_device_opts
#SBATCH -t 03:00:00 # Duration of the job (Ex: 15 mins)
#SBATCH -o$job_slug.out # Combined output and error messages file
#SBATCH -W # Do not exit until the submitted job terminates.
#SBATCH --exclude=atl1-1-02-009-33-0

set -e
set -x

cd "\$SLURM_SUBMIT_DIR"
echo "Running in $(pwd):"

job_slug="$job_slug"
job_device="$2"

. ./mfc.sh load -c p -m $2

$sbatch_script_contents

EOT

19 changes: 19 additions & 0 deletions .github/workflows/ice/test.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
#!/bin/bash

build_opts=""
if [ "$job_device" == "gpu" ]; then
build_opts="--gpu"
fi

./mfc.sh test --dry-run -j 8 $build_opts

n_test_threads=8

if [ "$job_device" == "gpu" ]; then
gpu_count=$(nvidia-smi -L | wc -l) # number of GPUs on node
gpu_ids=$(seq -s ' ' 0 $(($gpu_count-1))) # 0,1,2,...,gpu_count-1
device_opts="-g $gpu_ids"
n_test_threads=`expr $gpu_count \* 2`
fi

./mfc.sh test --max-attempts 3 -a -j $n_test_threads $device_opts -- -c phoenix
1 change: 1 addition & 0 deletions .github/workflows/line-count.yml
Original file line number Diff line number Diff line change
Expand Up @@ -49,5 +49,6 @@ jobs:
cd $BASE
export MFC_PR=$PR
pwd
./mfc.sh init &> tmp.txt
./mfc.sh count_diff

3 changes: 3 additions & 0 deletions .github/workflows/lint-toolchain.yml
Original file line number Diff line number Diff line change
Expand Up @@ -10,5 +10,8 @@ jobs:
steps:
- uses: actions/checkout@v4

- name: MFC Python setup
run: ./mfc.sh init

- name: Lint the toolchain
run: ./mfc.sh lint
2 changes: 1 addition & 1 deletion .github/workflows/phoenix/test.sh
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@ if [ "$job_device" == "gpu" ]; then
build_opts="--gpu"
fi

./mfc.sh build -j 8 $build_opts
./mfc.sh test --dry-run -j 8 $build_opts

n_test_threads=8

Expand Down
5 changes: 4 additions & 1 deletion .github/workflows/spelling.yml
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
name: Spell Check

on: [push, workflow_dispatch]
on: [push, pull_request, workflow_dispatch]

jobs:
run:
Expand All @@ -10,5 +10,8 @@ jobs:
- name: Checkout
uses: actions/checkout@v4

- name: MFC Python setup
run: ./mfc.sh init

- name: Spell Check
run: ./mfc.sh spelling
12 changes: 11 additions & 1 deletion .github/workflows/test.yml
Original file line number Diff line number Diff line change
Expand Up @@ -105,10 +105,12 @@ jobs:
strategy:
matrix:
device: ['cpu', 'gpu']
lbl: ['gt', 'frontier']
lbl: ['gt', 'delta', 'frontier']
exclude:
- device: cpu
lbl: frontier
- device: gpu
lbl: delta
runs-on:
group: phoenix
labels: ${{ matrix.lbl }}
Expand All @@ -123,6 +125,14 @@ jobs:
if: matrix.lbl == 'gt'
run: bash .github/workflows/phoenix/submit.sh .github/workflows/phoenix/test.sh ${{ matrix.device }}

# - name: Build & Test
# if: matrix.lbl == 'ice'
# run: bash .github/workflows/ice/submit.sh .github/workflows/ice/test.sh ${{ matrix.device }}

- name: Build & Test
if: matrix.lbl == 'delta'
run: bash .github/workflows/delta/submit.sh .github/workflows/delta/test.sh ${{ matrix.device }}

- name: Build
if: matrix.lbl == 'frontier'
run: bash .github/workflows/frontier/build.sh
Expand Down
2 changes: 0 additions & 2 deletions src/pre_process/p_main.f90
Original file line number Diff line number Diff line change
Expand Up @@ -25,8 +25,6 @@ program p_main

call s_initialize_mpi_domain()

! Initialization of the MPI environment

call s_initialize_modules()

call s_read_grid()
Expand Down
7 changes: 2 additions & 5 deletions src/simulation/p_main.fpp
Original file line number Diff line number Diff line change
Expand Up @@ -14,16 +14,13 @@
!! are only available in the volume fraction model.
program p_main

! Dependencies =============================================================

use m_global_parameters !< Definitions of the global parameters
use m_global_parameters

use m_start_up

use m_time_steppers

use m_nvtx
! ==========================================================================

implicit none

Expand Down Expand Up @@ -71,7 +68,7 @@ program p_main
finaltime = t_step_stop*dt
end if

call nvtxEndRange ! INIT
call nvtxEndRange

call nvtxStartRange("SIMULATION-TIME-MARCH")
! Time-stepping Loop =======================================================
Expand Down
1 change: 1 addition & 0 deletions toolchain/mfc/args.py
Original file line number Diff line number Diff line change
Expand Up @@ -82,6 +82,7 @@ def add_common_arguments(p, mask = None):
test.add_argument("-m", "--max-attempts", type=int, default=1, help="Maximum number of attempts to run a test.")
test.add_argument( "--no-build", action="store_true", default=False, help="(Testing) Do not rebuild MFC.")
test.add_argument("--case-optimization", action="store_true", default=False, help="(GPU Optimization) Compile MFC targets with some case parameters hard-coded.")
test.add_argument( "--dry-run", action="store_true", default=False, help="Build and generate case files but do not run tests.")
test_meg = test.add_mutually_exclusive_group()
test_meg.add_argument("--generate", action="store_true", default=False, help="(Test Generation) Generate golden files.")
test_meg.add_argument("--add-new-variables", action="store_true", default=False, help="(Test Generation) If new variables are found in D/ when running tests, add them to the golden files.")
Expand Down
3 changes: 3 additions & 0 deletions toolchain/mfc/run/run.py
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this allowed VERY granular views of what is happening within the toolchain when it is called

Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,8 @@

from . import queues, input

import hunter


def __validate_job_options() -> None:
if not ARG("mpi") and any({ARG("nodes") > 1, ARG("tasks_per_node") > 1}):
Expand Down Expand Up @@ -133,6 +135,7 @@ def __execute_job_script(qsystem: queues.QueueSystem):
raise MFCException(f"Submitting batch file for {qsystem.name} failed. It can be found here: {__job_script_filepath()}. Please check the file for errors.")


# @hunter.wrap(local=True)
def run(targets = None, case = None):
targets = get_targets(list(REQUIRED_TARGETS) + (targets or ARG("targets")))
case = case or input.load(ARG("input"), ARG("--"))
Expand Down
5 changes: 4 additions & 1 deletion toolchain/pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -37,7 +37,10 @@ dependencies = [

# Chemistry
"cantera",
"pyrometheus==1.0.2"
"pyrometheus==1.0.2",

# Logging
"hunter"
]

[tool.hatch.metadata]
Expand Down
Loading