Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for DoD Carpenter #475

Merged
merged 2 commits into from
Jun 24, 2024
Merged

Conversation

hyeoksu-lee
Copy link
Contributor

@hyeoksu-lee hyeoksu-lee commented Jun 20, 2024

Description

This PR is to add a .mako file for DoD Carpenter and to make relevant modifications to toolchains and mfc.sh file.

Type of change

Please delete options that are not relevant.

  • Something else

Scope

  • This PR comprises a set of related changes with a common goal

If you cannot check the above box, please split your PR into multiple PRs that each have a common goal.

How Has This Been Tested?

  • Test suites
  • examples/2D_mixing_artificial_Ma
  • What computers and compilers did you use to test this: Carpenter

Checklist

  • I have added comments for the new code
  • I added Doxygen docstrings to the new code
  • I have made corresponding changes to the documentation (docs/)
  • I have added regression tests to the test suite so that people can verify in the future that the feature is behaving as expected
  • I have added example cases in examples/ that demonstrate my new feature performing as expected.
    They run to completion and demonstrate "interesting physics"
  • I ran ./mfc.sh format before committing my code
  • New and existing tests pass locally with my changes, including with GPU capability enabled (both NVIDIA hardware with NVHPC compilers and AMD hardware with CRAY compilers) and disabled
  • This PR does not introduce any repeated code (it follows the DRY principle)
  • I cannot think of a way to condense this code and reduce any introduced additional line count

@hyeoksu-lee
Copy link
Contributor Author

hyeoksu-lee commented Jun 20, 2024

I think the 'unchecked' checklists are irrelevant to this PR. Please let me know if you want any further actions.

Copy link

codecov bot commented Jun 20, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 57.91%. Comparing base (37dd0f4) to head (1e15d2a).

Current head 1e15d2a differs from pull request most recent head e35017f

Please upload reports for the commit e35017f to get more accurate results.

Additional details and impacted files
@@           Coverage Diff           @@
##           master     #475   +/-   ##
=======================================
  Coverage   57.91%   57.91%           
=======================================
  Files          55       55           
  Lines       14230    14230           
  Branches     1854     1854           
=======================================
  Hits         8242     8242           
  Misses       5452     5452           
  Partials      536      536           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

did you test that this works in both interactive and batch mode?

Copy link
Contributor Author

@hyeoksu-lee hyeoksu-lee Jun 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Batch mode works great but interactive mode does not. In the interactive mode, I got some warning messages.

Fortunately, the simulation results from two modes are same. But I don't still understand why this error arises.

+-----------------------------------------------------------------------------------------------------------+
| MFC case # MFC @ /p/home/hyeoksu/MFC/MFC-Caltech/examples/1D_bubblescreen/case.py:                        |
+-----------------------------------------------------------------------------------------------------------+
| * Start-time     12:48:51                            * Start-date     12:48:51                            |
| * Partition      N/A                                 * Walltime       01:00:00                            |
| * Account        N/A                                 * Nodes          1                                   |
| * Job Name       MFC                                 * Engine         interactive                         |
| * QoS            N/A                                 * Binary         N/A                                 |
| * Queue System   Interactive                         * Email          N/A                                 |
+-----------------------------------------------------------------------------------------------------------+

mfc: OK > :) Loading modules:

mfc: Loading modules (& env variables) for DoD Carpenter on CPUs:
mfc:  $ module load python
mfc:  $ module load gcc/12.2.0
mfc:  $ module load cmake/3.28.1-gcc-12.2.0
mfc:  $ module load openmpi/4.1.6
mfc: OK > Found $CRAY_LD_LIBRARY_PATH. Prepending to $LD_LIBRARY_PATH.
mfc: OK > All modules and environment variables have been loaded.

mfc: OK > :) Running syscheck:

+ mpirun -np 10 /p/home/hyeoksu/MFC/MFC-Caltech/build/install/7e71bfa6d4/bin/syscheck
 [TEST] MPI: call mpi_init(ierr)
 [TEST] MPI: call mpi_init(ierr)
 [TEST] MPI: call mpi_init(ierr)
 [TEST] MPI: call mpi_init(ierr)
 [TEST] MPI: call mpi_init(ierr)
 [TEST] MPI: call mpi_init(ierr)
 [TEST] MPI: call mpi_init(ierr)
 [TEST] MPI: call mpi_init(ierr)
 [TEST] MPI: call mpi_init(ierr)
 [TEST] MPI: call mpi_init(ierr)
--------------------------------------------------------------------------
WARNING: Could not generate an xpmem segment id for this process'
address space.

The vader shared memory BTL will fall back on another single-copy
mechanism if one is available. This may result in lower performance.

  Local host: carpenter04
  Error code: 2 (No such file or directory)
--------------------------------------------------------------------------
--------------------------------------------------------------------------
No OpenFabrics connection schemes reported that they were able to be
used on a specific port.  As such, the openib BTL (OpenFabrics
support) will be disabled for this port.

  Local host:           carpenter04
  Local device:         mlx5_0
  Local port:           1
  CPCs attempted:       rdmacm, udcm
--------------------------------------------------------------------------
--------------------------------------------------------------------------
Open MPI failed an OFI Libfabric library call (fi_domain).  This is highly
unusual; your job may behave unpredictably (and/or abort) after this.

  Local host: carpenter04
  Location: ../../../../../ompi/mca/mtl/ofi/mtl_ofi_component.c:936
  Error: Function not implemented (38)
--------------------------------------------------------------------------
 [TEST] MPI: call mpi_comm_rank(MPI_COMM_WORLD, rank, ierr)
 [TEST] MPI: call mpi_comm_rank(MPI_COMM_WORLD, rank, ierr)
 [TEST] MPI: call mpi_comm_rank(MPI_COMM_WORLD, rank, ierr)
 [TEST] MPI: call mpi_comm_rank(MPI_COMM_WORLD, rank, ierr)
 [TEST] MPI: call mpi_comm_rank(MPI_COMM_WORLD, rank, ierr)
 [TEST] MPI: call mpi_comm_rank(MPI_COMM_WORLD, rank, ierr)
 [TEST] MPI: call mpi_comm_rank(MPI_COMM_WORLD, rank, ierr)
 [TEST] MPI: call mpi_comm_rank(MPI_COMM_WORLD, rank, ierr)
 [TEST] MPI: call mpi_comm_rank(MPI_COMM_WORLD, rank, ierr)
 [TEST] MPI: call mpi_comm_rank(MPI_COMM_WORLD, rank, ierr)
 [TEST] MPI: call mpi_barrier(MPI_COMM_WORLD, ierr)
 [TEST] MPI: call assert(rank >= 0)
 [TEST] MPI: call mpi_comm_size(MPI_COMM_WORLD, nRanks, ierr)
 [TEST] MPI: call assert(nRanks > 0 .and. rank < nRanks)
 [SKIP] ACC: devtype = acc_get_device_type()
 [SKIP] ACC: num_devices = acc_get_num_devices(devtype)
 [SKIP] ACC: call assert(num_devices > 0)
 [SKIP] ACC: call acc_set_device_num(mod(rank, nRanks), devtype)
 [SKIP] ACC: allocate(arr(1:N))
 [SKIP] ACC: !$acc enter data create(arr(1:N))
 [SKIP] ACC: !$acc parallel loop
 [SKIP] ACC: !$acc update host(arr(1:N))
 [SKIP] ACC: !$acc exit data delete(arr)
 [TEST] MPI: call mpi_barrier(MPI_COMM_WORLD, ierr)
 [TEST] MPI: call mpi_finalize(ierr)

 Syscheck: PASSED.
[carpenter04:80352] 9 more processes have sent help message help-btl-vader.txt / xpmem-make-failed
[carpenter04:80352] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
[carpenter04:80352] 29 more processes have sent help message help-mpi-btl-openib-cpc-base.txt / no cpcs for port
[carpenter04:80352] 9 more processes have sent help message help-mtl-ofi.txt / OFI call fail

mfc: OK > :) Running pre_process:

+ mpirun -np 10 /p/home/hyeoksu/MFC/MFC-Caltech/build/install/f3387f1277/bin/pre_process
--------------------------------------------------------------------------
WARNING: Could not generate an xpmem segment id for this process'
address space.

The vader shared memory BTL will fall back on another single-copy
mechanism if one is available. This may result in lower performance.

  Local host: carpenter04
  Error code: 2 (No such file or directory)
--------------------------------------------------------------------------
--------------------------------------------------------------------------
No OpenFabrics connection schemes reported that they were able to be
used on a specific port.  As such, the openib BTL (OpenFabrics
support) will be disabled for this port.

  Local host:           carpenter04
  Local device:         mlx5_0
  Local port:           1
  CPCs attempted:       rdmacm, udcm
--------------------------------------------------------------------------
--------------------------------------------------------------------------
Open MPI failed an OFI Libfabric library call (fi_domain).  This is highly
unusual; your job may behave unpredictably (and/or abort) after this.

  Local host: carpenter04
  Location: ../../../../../ompi/mca/mtl/ofi/mtl_ofi_component.c:936
  Error: Function not implemented (38)
--------------------------------------------------------------------------
 Pre-processing a 100x0x0 case on 10 rank(s)
 In convert, nbub:   2.3873241463784300E-013
 In convert, nbub:   9.5492965855137212E-006
 In convert, nbub:   2.3873241463784300E-013
 In convert, nbub:   9.5492965855137212E-006
 In convert, nbub:   2.3873241463784300E-013
 In convert, nbub:   2.3873241463784300E-013
 In convert, nbub:   9.5492965855137212E-006
 In convert, nbub:   2.3873241463784300E-013
 In convert, nbub:   2.3873241463784300E-013
 Processing patch           1
 Processing patch           2
 In convert, nbub:   2.3873241463784300E-013
 Elapsed Time   8.7239999999999540E-003
[carpenter04:80428] 9 more processes have sent help message help-btl-vader.txt / xpmem-make-failed
[carpenter04:80428] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
[carpenter04:80428] 29 more processes have sent help message help-mpi-btl-openib-cpc-base.txt / no cpcs for port
[carpenter04:80428] 9 more processes have sent help message help-mtl-ofi.txt / OFI call fail

mfc: OK > :) Running simulation:

+ mpirun -np 10 /p/home/hyeoksu/MFC/MFC-Caltech/build/install/e80345507a/bin/simulation
--------------------------------------------------------------------------
WARNING: Could not generate an xpmem segment id for this process'
address space.

The vader shared memory BTL will fall back on another single-copy
mechanism if one is available. This may result in lower performance.

  Local host: carpenter04
  Error code: 2 (No such file or directory)
--------------------------------------------------------------------------
--------------------------------------------------------------------------
No OpenFabrics connection schemes reported that they were able to be
used on a specific port.  As such, the openib BTL (OpenFabrics
support) will be disabled for this port.

  Local host:           carpenter04
  Local device:         mlx5_0
  Local port:           1
  CPCs attempted:       rdmacm, udcm
--------------------------------------------------------------------------
--------------------------------------------------------------------------
Open MPI failed an OFI Libfabric library call (fi_domain).  This is highly
unusual; your job may behave unpredictably (and/or abort) after this.

  Local host: carpenter04
  Location: ../../../../../ompi/mca/mtl/ofi/mtl_ofi_component.c:936
  Error: Function not implemented (38)
--------------------------------------------------------------------------
 Simulating a regular 100x0x0 case on 10 rank(s) on CPUs.
 [  0%]  Time step        1 of 761 @ t_step = 0
 [  1%]  Time step        2 of 761 @ t_step = 1
 [  1%]  Time step        3 of 761 @ t_step = 2
...
 [100%]  Time step      758 of 761 @ t_step = 757
 [100%]  Time step      759 of 761 @ t_step = 758
 [100%]  Time step      760 of 761 @ t_step = 759
 Performance:                        NaN  ns/gp/eq/rhs
[carpenter04:80542] 9 more processes have sent help message help-btl-vader.txt / xpmem-make-failed
[carpenter04:80542] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
[carpenter04:80542] 29 more processes have sent help message help-mpi-btl-openib-cpc-base.txt / no cpcs for port
[carpenter04:80542] 9 more processes have sent help message help-mtl-ofi.txt / OFI call fail

+-----------------------------------------------------------------------------------------------------------+
| Finished MFC:                                                                                             |
| * Total-time:    17s                                 * Exit Code:     0                                   |
| * End-time:      12:49:08                            * End-date:      12:49:08                            |
+-----------------------------------------------------------------------------------------------------------+

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not entirely sure

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please reference the other .mako files

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Carpenter is down right now for unknown reasons. I will try to fix this when Carpenter comes back online.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if it doesn't work in interactive mode then you won't be able to use it to run ./mfc.sh test

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

./mfc.sh test works well without such warning/error messages. It just fails on 2 MPI Ranks tests but I think this is different issue.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh actually I do not know if those test cases show warnings. The results looks fine though.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm curious of the cause of this. I think it may have to do with bash version.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm also curious about this. Can you share the output of $SHELL --version ? $SHELL should expand to the path to your default shell.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@henryleberre $SHELL --version output is zsh 5.6 (x86_64-suse-linux-gnu)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

even though it is zsh it gets invoked as ./mfc.sh and there is a /bin/bash shebang in all of the relevant scripts

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Astute observation @sbryngelson. @lee-hyeoksu can you share why you added the quotes around the == signs?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@henryleberre because Carpenter does not recognize == as ==, which leads to the error ./mfc.sh:25: = not found. It seems like Carpenter thinks whatever comes after = is a command. So, for example, if I put a line of code 'echo =load, it shows error ./mfc.sh:25: load not found.

But when I add the quotes ('=='), it works. That's why I added the quotes.

@sbryngelson sbryngelson merged commit 5122101 into MFlowCode:master Jun 24, 2024
19 checks passed
AiredaleDev pushed a commit to AiredaleDev/MFC that referenced this pull request Jun 28, 2024
Co-authored-by: Hyeoksu Lee <hyeoksu@carpenter04.hsn.ex4000.erdc.hpc.mil>
Co-authored-by: Hyeoksu Lee <hyeoksu@carpenter03.hsn.ex4000.erdc.hpc.mil>
@hyeoksu-lee hyeoksu-lee deleted the carpenter branch July 13, 2024 23:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

3 participants