-
Notifications
You must be signed in to change notification settings - Fork 72
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support for DoD Carpenter #475
Conversation
I think the 'unchecked' checklists are irrelevant to this PR. Please let me know if you want any further actions. |
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## master #475 +/- ##
=======================================
Coverage 57.91% 57.91%
=======================================
Files 55 55
Lines 14230 14230
Branches 1854 1854
=======================================
Hits 8242 8242
Misses 5452 5452
Partials 536 536 ☔ View full report in Codecov by Sentry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
did you test that this works in both interactive and batch mode?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Batch mode works great but interactive mode does not. In the interactive mode, I got some warning messages.
Fortunately, the simulation results from two modes are same. But I don't still understand why this error arises.
+-----------------------------------------------------------------------------------------------------------+
| MFC case # MFC @ /p/home/hyeoksu/MFC/MFC-Caltech/examples/1D_bubblescreen/case.py: |
+-----------------------------------------------------------------------------------------------------------+
| * Start-time 12:48:51 * Start-date 12:48:51 |
| * Partition N/A * Walltime 01:00:00 |
| * Account N/A * Nodes 1 |
| * Job Name MFC * Engine interactive |
| * QoS N/A * Binary N/A |
| * Queue System Interactive * Email N/A |
+-----------------------------------------------------------------------------------------------------------+
mfc: OK > :) Loading modules:
mfc: Loading modules (& env variables) for DoD Carpenter on CPUs:
mfc: $ module load python
mfc: $ module load gcc/12.2.0
mfc: $ module load cmake/3.28.1-gcc-12.2.0
mfc: $ module load openmpi/4.1.6
mfc: OK > Found $CRAY_LD_LIBRARY_PATH. Prepending to $LD_LIBRARY_PATH.
mfc: OK > All modules and environment variables have been loaded.
mfc: OK > :) Running syscheck:
+ mpirun -np 10 /p/home/hyeoksu/MFC/MFC-Caltech/build/install/7e71bfa6d4/bin/syscheck
[TEST] MPI: call mpi_init(ierr)
[TEST] MPI: call mpi_init(ierr)
[TEST] MPI: call mpi_init(ierr)
[TEST] MPI: call mpi_init(ierr)
[TEST] MPI: call mpi_init(ierr)
[TEST] MPI: call mpi_init(ierr)
[TEST] MPI: call mpi_init(ierr)
[TEST] MPI: call mpi_init(ierr)
[TEST] MPI: call mpi_init(ierr)
[TEST] MPI: call mpi_init(ierr)
--------------------------------------------------------------------------
WARNING: Could not generate an xpmem segment id for this process'
address space.
The vader shared memory BTL will fall back on another single-copy
mechanism if one is available. This may result in lower performance.
Local host: carpenter04
Error code: 2 (No such file or directory)
--------------------------------------------------------------------------
--------------------------------------------------------------------------
No OpenFabrics connection schemes reported that they were able to be
used on a specific port. As such, the openib BTL (OpenFabrics
support) will be disabled for this port.
Local host: carpenter04
Local device: mlx5_0
Local port: 1
CPCs attempted: rdmacm, udcm
--------------------------------------------------------------------------
--------------------------------------------------------------------------
Open MPI failed an OFI Libfabric library call (fi_domain). This is highly
unusual; your job may behave unpredictably (and/or abort) after this.
Local host: carpenter04
Location: ../../../../../ompi/mca/mtl/ofi/mtl_ofi_component.c:936
Error: Function not implemented (38)
--------------------------------------------------------------------------
[TEST] MPI: call mpi_comm_rank(MPI_COMM_WORLD, rank, ierr)
[TEST] MPI: call mpi_comm_rank(MPI_COMM_WORLD, rank, ierr)
[TEST] MPI: call mpi_comm_rank(MPI_COMM_WORLD, rank, ierr)
[TEST] MPI: call mpi_comm_rank(MPI_COMM_WORLD, rank, ierr)
[TEST] MPI: call mpi_comm_rank(MPI_COMM_WORLD, rank, ierr)
[TEST] MPI: call mpi_comm_rank(MPI_COMM_WORLD, rank, ierr)
[TEST] MPI: call mpi_comm_rank(MPI_COMM_WORLD, rank, ierr)
[TEST] MPI: call mpi_comm_rank(MPI_COMM_WORLD, rank, ierr)
[TEST] MPI: call mpi_comm_rank(MPI_COMM_WORLD, rank, ierr)
[TEST] MPI: call mpi_comm_rank(MPI_COMM_WORLD, rank, ierr)
[TEST] MPI: call mpi_barrier(MPI_COMM_WORLD, ierr)
[TEST] MPI: call assert(rank >= 0)
[TEST] MPI: call mpi_comm_size(MPI_COMM_WORLD, nRanks, ierr)
[TEST] MPI: call assert(nRanks > 0 .and. rank < nRanks)
[SKIP] ACC: devtype = acc_get_device_type()
[SKIP] ACC: num_devices = acc_get_num_devices(devtype)
[SKIP] ACC: call assert(num_devices > 0)
[SKIP] ACC: call acc_set_device_num(mod(rank, nRanks), devtype)
[SKIP] ACC: allocate(arr(1:N))
[SKIP] ACC: !$acc enter data create(arr(1:N))
[SKIP] ACC: !$acc parallel loop
[SKIP] ACC: !$acc update host(arr(1:N))
[SKIP] ACC: !$acc exit data delete(arr)
[TEST] MPI: call mpi_barrier(MPI_COMM_WORLD, ierr)
[TEST] MPI: call mpi_finalize(ierr)
Syscheck: PASSED.
[carpenter04:80352] 9 more processes have sent help message help-btl-vader.txt / xpmem-make-failed
[carpenter04:80352] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
[carpenter04:80352] 29 more processes have sent help message help-mpi-btl-openib-cpc-base.txt / no cpcs for port
[carpenter04:80352] 9 more processes have sent help message help-mtl-ofi.txt / OFI call fail
mfc: OK > :) Running pre_process:
+ mpirun -np 10 /p/home/hyeoksu/MFC/MFC-Caltech/build/install/f3387f1277/bin/pre_process
--------------------------------------------------------------------------
WARNING: Could not generate an xpmem segment id for this process'
address space.
The vader shared memory BTL will fall back on another single-copy
mechanism if one is available. This may result in lower performance.
Local host: carpenter04
Error code: 2 (No such file or directory)
--------------------------------------------------------------------------
--------------------------------------------------------------------------
No OpenFabrics connection schemes reported that they were able to be
used on a specific port. As such, the openib BTL (OpenFabrics
support) will be disabled for this port.
Local host: carpenter04
Local device: mlx5_0
Local port: 1
CPCs attempted: rdmacm, udcm
--------------------------------------------------------------------------
--------------------------------------------------------------------------
Open MPI failed an OFI Libfabric library call (fi_domain). This is highly
unusual; your job may behave unpredictably (and/or abort) after this.
Local host: carpenter04
Location: ../../../../../ompi/mca/mtl/ofi/mtl_ofi_component.c:936
Error: Function not implemented (38)
--------------------------------------------------------------------------
Pre-processing a 100x0x0 case on 10 rank(s)
In convert, nbub: 2.3873241463784300E-013
In convert, nbub: 9.5492965855137212E-006
In convert, nbub: 2.3873241463784300E-013
In convert, nbub: 9.5492965855137212E-006
In convert, nbub: 2.3873241463784300E-013
In convert, nbub: 2.3873241463784300E-013
In convert, nbub: 9.5492965855137212E-006
In convert, nbub: 2.3873241463784300E-013
In convert, nbub: 2.3873241463784300E-013
Processing patch 1
Processing patch 2
In convert, nbub: 2.3873241463784300E-013
Elapsed Time 8.7239999999999540E-003
[carpenter04:80428] 9 more processes have sent help message help-btl-vader.txt / xpmem-make-failed
[carpenter04:80428] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
[carpenter04:80428] 29 more processes have sent help message help-mpi-btl-openib-cpc-base.txt / no cpcs for port
[carpenter04:80428] 9 more processes have sent help message help-mtl-ofi.txt / OFI call fail
mfc: OK > :) Running simulation:
+ mpirun -np 10 /p/home/hyeoksu/MFC/MFC-Caltech/build/install/e80345507a/bin/simulation
--------------------------------------------------------------------------
WARNING: Could not generate an xpmem segment id for this process'
address space.
The vader shared memory BTL will fall back on another single-copy
mechanism if one is available. This may result in lower performance.
Local host: carpenter04
Error code: 2 (No such file or directory)
--------------------------------------------------------------------------
--------------------------------------------------------------------------
No OpenFabrics connection schemes reported that they were able to be
used on a specific port. As such, the openib BTL (OpenFabrics
support) will be disabled for this port.
Local host: carpenter04
Local device: mlx5_0
Local port: 1
CPCs attempted: rdmacm, udcm
--------------------------------------------------------------------------
--------------------------------------------------------------------------
Open MPI failed an OFI Libfabric library call (fi_domain). This is highly
unusual; your job may behave unpredictably (and/or abort) after this.
Local host: carpenter04
Location: ../../../../../ompi/mca/mtl/ofi/mtl_ofi_component.c:936
Error: Function not implemented (38)
--------------------------------------------------------------------------
Simulating a regular 100x0x0 case on 10 rank(s) on CPUs.
[ 0%] Time step 1 of 761 @ t_step = 0
[ 1%] Time step 2 of 761 @ t_step = 1
[ 1%] Time step 3 of 761 @ t_step = 2
...
[100%] Time step 758 of 761 @ t_step = 757
[100%] Time step 759 of 761 @ t_step = 758
[100%] Time step 760 of 761 @ t_step = 759
Performance: NaN ns/gp/eq/rhs
[carpenter04:80542] 9 more processes have sent help message help-btl-vader.txt / xpmem-make-failed
[carpenter04:80542] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
[carpenter04:80542] 29 more processes have sent help message help-mpi-btl-openib-cpc-base.txt / no cpcs for port
[carpenter04:80542] 9 more processes have sent help message help-mtl-ofi.txt / OFI call fail
+-----------------------------------------------------------------------------------------------------------+
| Finished MFC: |
| * Total-time: 17s * Exit Code: 0 |
| * End-time: 12:49:08 * End-date: 12:49:08 |
+-----------------------------------------------------------------------------------------------------------+
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not entirely sure
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please reference the other .mako
files
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Carpenter is down right now for unknown reasons. I will try to fix this when Carpenter comes back online.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if it doesn't work in interactive mode then you won't be able to use it to run ./mfc.sh test
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
./mfc.sh test
works well without such warning/error messages. It just fails on 2 MPI Ranks
tests but I think this is different issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh actually I do not know if those test cases show warnings. The results looks fine though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm curious of the cause of this. I think it may have to do with bash version.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm also curious about this. Can you share the output of $SHELL --version
? $SHELL
should expand to the path to your default shell.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@henryleberre $SHELL --version
output is zsh 5.6 (x86_64-suse-linux-gnu)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
even though it is zsh
it gets invoked as ./mfc.sh
and there is a /bin/bash
shebang in all of the relevant scripts
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Astute observation @sbryngelson. @lee-hyeoksu can you share why you added the quotes around the ==
signs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@henryleberre because Carpenter does not recognize ==
as ==
, which leads to the error ./mfc.sh:25: = not found
. It seems like Carpenter thinks whatever comes after =
is a command. So, for example, if I put a line of code 'echo =load
, it shows error ./mfc.sh:25: load not found
.
But when I add the quotes ('=='
), it works. That's why I added the quotes.
Co-authored-by: Hyeoksu Lee <hyeoksu@carpenter04.hsn.ex4000.erdc.hpc.mil> Co-authored-by: Hyeoksu Lee <hyeoksu@carpenter03.hsn.ex4000.erdc.hpc.mil>
Description
This PR is to add a
.mako
file for DoD Carpenter and to make relevant modifications to toolchains and mfc.sh file.Type of change
Please delete options that are not relevant.
Scope
If you cannot check the above box, please split your PR into multiple PRs that each have a common goal.
How Has This Been Tested?
examples/2D_mixing_artificial_Ma
Checklist
docs/
)examples/
that demonstrate my new feature performing as expected.They run to completion and demonstrate "interesting physics"
./mfc.sh format
before committing my code