Skip to content

Support for DoD Carpenter #475

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Jun 24, 2024
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
10 changes: 5 additions & 5 deletions mfc.sh
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm curious of the cause of this. I think it may have to do with bash version.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm also curious about this. Can you share the output of $SHELL --version ? $SHELL should expand to the path to your default shell.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@henryleberre $SHELL --version output is zsh 5.6 (x86_64-suse-linux-gnu)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

even though it is zsh it gets invoked as ./mfc.sh and there is a /bin/bash shebang in all of the relevant scripts

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Astute observation @sbryngelson. @lee-hyeoksu can you share why you added the quotes around the == signs?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@henryleberre because Carpenter does not recognize == as ==, which leads to the error ./mfc.sh:25: = not found. It seems like Carpenter thinks whatever comes after = is a command. So, for example, if I put a line of code 'echo =load, it shows error ./mfc.sh:25: load not found.

But when I add the quotes ('=='), it works. That's why I added the quotes.

Original file line number Diff line number Diff line change
Expand Up @@ -22,19 +22,19 @@ if [ -d "$(pwd)/bootstrap" ] || [ -d "$(pwd)/dependencies" ] || [ -f "$(pwd)/bui
fi

# If the user wishes to run the "load" script
if [ "$1" == 'load' ]; then
if [ "$1" '==' 'load' ]; then
shift; . "$(pwd)/toolchain/bootstrap/modules.sh" $@; return
elif [ "$1" == "lint" ]; then
elif [ "$1" '==' "lint" ]; then
. "$(pwd)/toolchain/bootstrap/python.sh"

shift; . "$(pwd)/toolchain/bootstrap/lint.sh" $@; exit 0
elif [ "$1" == "format" ]; then
elif [ "$1" '==' "format" ]; then
. "$(pwd)/toolchain/bootstrap/python.sh"

shift; . "$(pwd)/toolchain/bootstrap/format.sh" $@; exit 0
elif [ "$1" == "docker" ]; then
elif [ "$1" '==' "docker" ]; then
shift; . "$(pwd)/toolchain/bootstrap/docker.sh" $@; exit 0
elif [ "$1" == "venv" ]; then
elif [ "$1" '==' "venv" ]; then
shift; . "$(pwd)/toolchain/bootstrap/python.sh" $@; return
fi

Expand Down
7 changes: 4 additions & 3 deletions toolchain/bootstrap/modules.sh
Original file line number Diff line number Diff line change
Expand Up @@ -23,7 +23,8 @@ if [ -v $u_c ]; then
log "$C""ACCESS$W: Bridges2 (b) | Expanse (e) | Delta (d)"
log "$Y""Gatech$W: Phoenix (p)"
log "$R""Caltech$W: Richardson (r)"
log_n "($G""a$W/$G""f$W/$G""s$W/$G""w$W/$C""b$W/$C""e$CR/$C""d$CR/$Y""p$CR/$R""r$CR): "
log "$B""DoD$W: Carpenter (c)"
log_n "($G""a$W/$G""f$W/$G""s$W/$G""w$W/$C""b$W/$C""e$CR/$C""d$CR/$Y""p$CR/$R""r$CR/$B""c$CR): "
read u_c
log
fi
Expand All @@ -42,9 +43,9 @@ fi
u_c=$(echo "$u_c" | tr '[:upper:]' '[:lower:]')
u_cg=$(echo "$u_cg" | tr '[:upper:]' '[:lower:]')

if [ "$u_cg" == 'c' ] || [ "$u_cg" == 'cpu' ]; then
if [ "$u_cg" '==' 'c' ] || [ "$u_cg" '==' 'cpu' ]; then
CG='CPU'; cg='cpu'
elif [ "$u_cg" == "g" ] || [ "$u_cg" == 'gpu' ]; then
elif [ "$u_cg" '==' "g" ] || [ "$u_cg" '==' 'gpu' ]; then
CG='GPU'; cg='gpu'
fi

Expand Down
5 changes: 4 additions & 1 deletion toolchain/modules
Original file line number Diff line number Diff line change
Expand Up @@ -54,9 +54,12 @@ f-all cray-fftw cray-hdf5 cray-mpich/8.1.26 cce/16.0.1
f-all rocm/5.5.1 cray-python omniperf
f-cpu


d NCSA Delta
d-all python/3.11.6
d-cpu gcc/11.4.0 openmpi
d-gpu nvhpc/22.11 openmpi+cuda/4.1.5+cuda cmake
d-gpu CC=nvc CXX=nvc++ FC=nvfortran

c DoD Carpenter
c-all python
c-cpu gcc/12.2.0 cmake/3.28.1-gcc-12.2.0 openmpi/4.1.6
49 changes: 49 additions & 0 deletions toolchain/templates/carpenter.mako
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

did you test that this works in both interactive and batch mode?

Copy link
Contributor Author

@hyeoksu-lee hyeoksu-lee Jun 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Batch mode works great but interactive mode does not. In the interactive mode, I got some warning messages.

Fortunately, the simulation results from two modes are same. But I don't still understand why this error arises.

+-----------------------------------------------------------------------------------------------------------+
| MFC case # MFC @ /p/home/hyeoksu/MFC/MFC-Caltech/examples/1D_bubblescreen/case.py:                        |
+-----------------------------------------------------------------------------------------------------------+
| * Start-time     12:48:51                            * Start-date     12:48:51                            |
| * Partition      N/A                                 * Walltime       01:00:00                            |
| * Account        N/A                                 * Nodes          1                                   |
| * Job Name       MFC                                 * Engine         interactive                         |
| * QoS            N/A                                 * Binary         N/A                                 |
| * Queue System   Interactive                         * Email          N/A                                 |
+-----------------------------------------------------------------------------------------------------------+

mfc: OK > :) Loading modules:

mfc: Loading modules (& env variables) for DoD Carpenter on CPUs:
mfc:  $ module load python
mfc:  $ module load gcc/12.2.0
mfc:  $ module load cmake/3.28.1-gcc-12.2.0
mfc:  $ module load openmpi/4.1.6
mfc: OK > Found $CRAY_LD_LIBRARY_PATH. Prepending to $LD_LIBRARY_PATH.
mfc: OK > All modules and environment variables have been loaded.

mfc: OK > :) Running syscheck:

+ mpirun -np 10 /p/home/hyeoksu/MFC/MFC-Caltech/build/install/7e71bfa6d4/bin/syscheck
 [TEST] MPI: call mpi_init(ierr)
 [TEST] MPI: call mpi_init(ierr)
 [TEST] MPI: call mpi_init(ierr)
 [TEST] MPI: call mpi_init(ierr)
 [TEST] MPI: call mpi_init(ierr)
 [TEST] MPI: call mpi_init(ierr)
 [TEST] MPI: call mpi_init(ierr)
 [TEST] MPI: call mpi_init(ierr)
 [TEST] MPI: call mpi_init(ierr)
 [TEST] MPI: call mpi_init(ierr)
--------------------------------------------------------------------------
WARNING: Could not generate an xpmem segment id for this process'
address space.

The vader shared memory BTL will fall back on another single-copy
mechanism if one is available. This may result in lower performance.

  Local host: carpenter04
  Error code: 2 (No such file or directory)
--------------------------------------------------------------------------
--------------------------------------------------------------------------
No OpenFabrics connection schemes reported that they were able to be
used on a specific port.  As such, the openib BTL (OpenFabrics
support) will be disabled for this port.

  Local host:           carpenter04
  Local device:         mlx5_0
  Local port:           1
  CPCs attempted:       rdmacm, udcm
--------------------------------------------------------------------------
--------------------------------------------------------------------------
Open MPI failed an OFI Libfabric library call (fi_domain).  This is highly
unusual; your job may behave unpredictably (and/or abort) after this.

  Local host: carpenter04
  Location: ../../../../../ompi/mca/mtl/ofi/mtl_ofi_component.c:936
  Error: Function not implemented (38)
--------------------------------------------------------------------------
 [TEST] MPI: call mpi_comm_rank(MPI_COMM_WORLD, rank, ierr)
 [TEST] MPI: call mpi_comm_rank(MPI_COMM_WORLD, rank, ierr)
 [TEST] MPI: call mpi_comm_rank(MPI_COMM_WORLD, rank, ierr)
 [TEST] MPI: call mpi_comm_rank(MPI_COMM_WORLD, rank, ierr)
 [TEST] MPI: call mpi_comm_rank(MPI_COMM_WORLD, rank, ierr)
 [TEST] MPI: call mpi_comm_rank(MPI_COMM_WORLD, rank, ierr)
 [TEST] MPI: call mpi_comm_rank(MPI_COMM_WORLD, rank, ierr)
 [TEST] MPI: call mpi_comm_rank(MPI_COMM_WORLD, rank, ierr)
 [TEST] MPI: call mpi_comm_rank(MPI_COMM_WORLD, rank, ierr)
 [TEST] MPI: call mpi_comm_rank(MPI_COMM_WORLD, rank, ierr)
 [TEST] MPI: call mpi_barrier(MPI_COMM_WORLD, ierr)
 [TEST] MPI: call assert(rank >= 0)
 [TEST] MPI: call mpi_comm_size(MPI_COMM_WORLD, nRanks, ierr)
 [TEST] MPI: call assert(nRanks > 0 .and. rank < nRanks)
 [SKIP] ACC: devtype = acc_get_device_type()
 [SKIP] ACC: num_devices = acc_get_num_devices(devtype)
 [SKIP] ACC: call assert(num_devices > 0)
 [SKIP] ACC: call acc_set_device_num(mod(rank, nRanks), devtype)
 [SKIP] ACC: allocate(arr(1:N))
 [SKIP] ACC: !$acc enter data create(arr(1:N))
 [SKIP] ACC: !$acc parallel loop
 [SKIP] ACC: !$acc update host(arr(1:N))
 [SKIP] ACC: !$acc exit data delete(arr)
 [TEST] MPI: call mpi_barrier(MPI_COMM_WORLD, ierr)
 [TEST] MPI: call mpi_finalize(ierr)

 Syscheck: PASSED.
[carpenter04:80352] 9 more processes have sent help message help-btl-vader.txt / xpmem-make-failed
[carpenter04:80352] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
[carpenter04:80352] 29 more processes have sent help message help-mpi-btl-openib-cpc-base.txt / no cpcs for port
[carpenter04:80352] 9 more processes have sent help message help-mtl-ofi.txt / OFI call fail

mfc: OK > :) Running pre_process:

+ mpirun -np 10 /p/home/hyeoksu/MFC/MFC-Caltech/build/install/f3387f1277/bin/pre_process
--------------------------------------------------------------------------
WARNING: Could not generate an xpmem segment id for this process'
address space.

The vader shared memory BTL will fall back on another single-copy
mechanism if one is available. This may result in lower performance.

  Local host: carpenter04
  Error code: 2 (No such file or directory)
--------------------------------------------------------------------------
--------------------------------------------------------------------------
No OpenFabrics connection schemes reported that they were able to be
used on a specific port.  As such, the openib BTL (OpenFabrics
support) will be disabled for this port.

  Local host:           carpenter04
  Local device:         mlx5_0
  Local port:           1
  CPCs attempted:       rdmacm, udcm
--------------------------------------------------------------------------
--------------------------------------------------------------------------
Open MPI failed an OFI Libfabric library call (fi_domain).  This is highly
unusual; your job may behave unpredictably (and/or abort) after this.

  Local host: carpenter04
  Location: ../../../../../ompi/mca/mtl/ofi/mtl_ofi_component.c:936
  Error: Function not implemented (38)
--------------------------------------------------------------------------
 Pre-processing a 100x0x0 case on 10 rank(s)
 In convert, nbub:   2.3873241463784300E-013
 In convert, nbub:   9.5492965855137212E-006
 In convert, nbub:   2.3873241463784300E-013
 In convert, nbub:   9.5492965855137212E-006
 In convert, nbub:   2.3873241463784300E-013
 In convert, nbub:   2.3873241463784300E-013
 In convert, nbub:   9.5492965855137212E-006
 In convert, nbub:   2.3873241463784300E-013
 In convert, nbub:   2.3873241463784300E-013
 Processing patch           1
 Processing patch           2
 In convert, nbub:   2.3873241463784300E-013
 Elapsed Time   8.7239999999999540E-003
[carpenter04:80428] 9 more processes have sent help message help-btl-vader.txt / xpmem-make-failed
[carpenter04:80428] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
[carpenter04:80428] 29 more processes have sent help message help-mpi-btl-openib-cpc-base.txt / no cpcs for port
[carpenter04:80428] 9 more processes have sent help message help-mtl-ofi.txt / OFI call fail

mfc: OK > :) Running simulation:

+ mpirun -np 10 /p/home/hyeoksu/MFC/MFC-Caltech/build/install/e80345507a/bin/simulation
--------------------------------------------------------------------------
WARNING: Could not generate an xpmem segment id for this process'
address space.

The vader shared memory BTL will fall back on another single-copy
mechanism if one is available. This may result in lower performance.

  Local host: carpenter04
  Error code: 2 (No such file or directory)
--------------------------------------------------------------------------
--------------------------------------------------------------------------
No OpenFabrics connection schemes reported that they were able to be
used on a specific port.  As such, the openib BTL (OpenFabrics
support) will be disabled for this port.

  Local host:           carpenter04
  Local device:         mlx5_0
  Local port:           1
  CPCs attempted:       rdmacm, udcm
--------------------------------------------------------------------------
--------------------------------------------------------------------------
Open MPI failed an OFI Libfabric library call (fi_domain).  This is highly
unusual; your job may behave unpredictably (and/or abort) after this.

  Local host: carpenter04
  Location: ../../../../../ompi/mca/mtl/ofi/mtl_ofi_component.c:936
  Error: Function not implemented (38)
--------------------------------------------------------------------------
 Simulating a regular 100x0x0 case on 10 rank(s) on CPUs.
 [  0%]  Time step        1 of 761 @ t_step = 0
 [  1%]  Time step        2 of 761 @ t_step = 1
 [  1%]  Time step        3 of 761 @ t_step = 2
...
 [100%]  Time step      758 of 761 @ t_step = 757
 [100%]  Time step      759 of 761 @ t_step = 758
 [100%]  Time step      760 of 761 @ t_step = 759
 Performance:                        NaN  ns/gp/eq/rhs
[carpenter04:80542] 9 more processes have sent help message help-btl-vader.txt / xpmem-make-failed
[carpenter04:80542] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
[carpenter04:80542] 29 more processes have sent help message help-mpi-btl-openib-cpc-base.txt / no cpcs for port
[carpenter04:80542] 9 more processes have sent help message help-mtl-ofi.txt / OFI call fail

+-----------------------------------------------------------------------------------------------------------+
| Finished MFC:                                                                                             |
| * Total-time:    17s                                 * Exit Code:     0                                   |
| * End-time:      12:49:08                            * End-date:      12:49:08                            |
+-----------------------------------------------------------------------------------------------------------+

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not entirely sure

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please reference the other .mako files

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Carpenter is down right now for unknown reasons. I will try to fix this when Carpenter comes back online.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if it doesn't work in interactive mode then you won't be able to use it to run ./mfc.sh test

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

./mfc.sh test works well without such warning/error messages. It just fails on 2 MPI Ranks tests but I think this is different issue.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh actually I do not know if those test cases show warnings. The results looks fine though.

Original file line number Diff line number Diff line change
@@ -0,0 +1,49 @@
#!/usr/bin/env bash

<%namespace name="helpers" file="helpers.mako"/>

% if engine == 'batch':
#PBS -l select=${nodes}:ncpus=192:mpiprocs=${tasks_per_node}
#PBS -N "${name}"
#PBS -l walltime=${walltime}
% if partition:
#PBS -q ${partition}
% endif
% if account:
#PBS -A ${account}
% endif
% if email:
#PBS -M ${email}
#PBS -m abe
% endif
#PBS -o "${name}.out"
#PBS -e "${name}.err"
#PBS -V
% endif

${helpers.template_prologue()}

ok ":) Loading modules:\n"
cd "${MFC_ROOTDIR}"
. ./mfc.sh load -c c -m ${'g' if gpu else 'c'}
cd - > /dev/null
echo


% for target in targets:
${helpers.run_prologue(target)}

% if not mpi:
(set -x; ${profiler} "${target.get_install_binpath(case)}")
% else:
(set -x; ${profiler} \
mpirun -np ${nodes*tasks_per_node} \
"${target.get_install_binpath(case)}")
% endif

${helpers.run_epilogue(target)}

echo
% endfor

${helpers.template_epilogue()}
4 changes: 2 additions & 2 deletions toolchain/util.sh
Original file line number Diff line number Diff line change
Expand Up @@ -2,10 +2,10 @@

if [ -t 1 ]; then
RED="\x1B[31m"; CYAN="\x1B[36m"; GREEN="\x1B[32m"
YELLOW="\x1B[33m"; MAGENTA="\x1B[35m"; COLOR_RESET="\033[m"
YELLOW="\x1B[33m"; MAGENTA="\x1B[35m"; BLUE="\x1B[34m"; COLOR_RESET="\033[m"

R=$RED; C=$CYAN; G=$GREEN
Y=$YELLOW; M=$MAGENTA; CR=$COLOR_RESET; W=$CR
Y=$YELLOW; M=$MAGENTA; B=$BLUE; CR=$COLOR_RESET; W=$CR
fi

log() { echo -e "$CYAN"mfc"$COLOR_RESET: $1$COLOR_RESET"; }
Expand Down
Loading