Skip to content

Conversation

@t00sa
Copy link
Collaborator

@t00sa t00sa commented Nov 26, 2025

Description

The get_handle() method should return an MPI_Comm and not an int, because this causes failures when building with OpenMPI. This changes the prototype in the header file and the method in the source file to address the problem.

Linked issues

NA

Type of change

  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • This change requires a documentation update
  • Bug fix

How has this been tested?

Prior to the fix, code would not compile with OpenMPI:

/var/tmp/persistent/Development/vernier/src/c++/writer/single.cpp: In member function ‘virtual void meto::SingleFile::write(meto::hashvec_t)’:
/var/tmp/persistent/Development/vernier/src/c++/writer/single.cpp:59:40: error: invalid conversion from ‘int’ to ‘MPI_Comm’ {aka ‘ompi_communicator_t*’} [-fpermissive]
   59 |                 mpi_context_.get_handle());
      |                 ~~~~~~~~~~~~~~~~~~~~~~~^~
      |                                        |
      |                                        int
In file included from /var/tmp/persistent/Development/vernier/src/c++/writer/../vernier_mpi.h:20,
                 from /var/tmp/persistent/Development/vernier/src/c++/writer/../mpi_context.h:26,
                 from /var/tmp/persistent/Development/vernier/src/c++/writer/single.h:20,
                 from /var/tmp/persistent/Development/vernier/src/c++/writer/single.cpp:8:
/home/users/sam.clarke/spack/opt/linux-rhel9-zen2/gcc-12.2.0/openmpi-5.0.5-z3frthbl74nsqm6vxyd5xlbcbd6yv4qv/include/mpi.h:1416:54: note:   initializing argument 6 of ‘int MPI_Allreduce(const void*, void*, int, MPI_Datatype, MPI_Op, MPI_Comm)’
 1416 |                                  MPI_Op op, MPI_Comm comm);
      |                                             ~~~~~~~~~^~~~
/var/tmp/persistent/Development/vernier/src/c++/writer/single.cpp:70:40: error: invalid conversion from ‘int’ to ‘MPI_Comm’ {aka ‘ompi_communicator_t*’} [-fpermissive]
   70 |   MPI_File_open(mpi_context_.get_handle(), filename.c_str(),
      |                 ~~~~~~~~~~~~~~~~~~~~~~~^~
      |                                        |
      |                                        int
In file included from /var/tmp/persistent/Development/vernier/src/c++/writer/../vernier_mpi.h:20,
                 from /var/tmp/persistent/Development/vernier/src/c++/writer/../mpi_context.h:26,
                 from /var/tmp/persistent/Development/vernier/src/c++/writer/single.h:20,
                 from /var/tmp/persistent/Development/vernier/src/c++/writer/single.cpp:8:
/home/users/sam.clarke/spack/opt/linux-rhel9-zen2/gcc-12.2.0/openmpi-5.0.5-z3frthbl74nsqm6vxyd5xlbcbd6yv4qv/include/mpi.h:1574:43: note:   initializing argument 1 of ‘int MPI_File_open(MPI_Comm, const char*, int, MPI_Info, ompi_file_t**)’
 1574 | OMPI_DECLSPEC  int MPI_File_open(MPI_Comm comm, const char *filename, int amode,
      |

Following the change, the code now compiles successfully with OpenMPI. Change was also validated by building with MPICH.

Checklist:

  • My code follows the style guidelines of this project
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes, for both debug and optimised builds

The get_handle() method should return an MPI_Comm and not an int,
because this causes failures when building with OpenMPI.
Copy link
Collaborator

@andrewcoughtrie andrewcoughtrie left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The change itself looks good, I've tested it with an OpenMPI based stack and it built successfully, however one of the unit-tests fails with the following error and will need sorting out.

[ RUN      ] MPINotInitTest.MpiNotInitialised
*******************/Vernier/tests/unit_tests/c++/test_mpi_not_init.cpp:20: Failure
Death test: { meto::vernier.init(); }
    Result: died but not with expected exit code:
            Exited with exit status 14
Actual msg:
[  DEATH   ] MPIContext::init. MPI not initialized.
[  DEATH   ] *** The MPI_Abort() function was called before MPI_INIT was invoked.
[  DEATH   ] *** This is disallowed by the MPI standard.
[  DEATH   ] *** Your MPI job will now abort.
[  DEATH   ] [************] Local abort before MPI_INIT completed completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed!
[  DEATH   ] 

[  FAILED  ] MPINotInitTest.MpiNotInitialised (2 ms)
[----------] 1 test from MPINotInitTest (2 ms total)

@t00sa
Copy link
Collaborator Author

t00sa commented Jan 8, 2026

The test output is correct: Vernier does not comply with the MPI standard because it calls MPI_Abort in error_handler.cpp even where MPI has not been initialised. From standard:

The only MPI functions that may be invoked before the MPI initialization routines are called are MPI_GET_VERSION, MPI_GET_LIBRARY_VERSION, MPI_INITIALIZED, MPI_FINALIZED, and any function with the prefix MPI_T_ (within the constraints for functions with this prefix listed in Section Initialization and Finalization).

MPICH is clearly willing to tolerate this but OpenMPI is not. It should be possible to resolve the problem by protecting the MPI_Abort call by first calling MPI_Initialised. Given that the intent of this ticket was resolve a portability problem found when OpenMPI testing was introduced, I think it's reasonable to fix this bug as part of the same ticket.

Check whether MPI is initialised before trying to call MPI_Abort.
This improves compliance with the standard and prevents test failures
when building with OpenMPI.
@t00sa
Copy link
Collaborator Author

t00sa commented Jan 8, 2026

Clean run with OpenMPI:

Running tests...
Test project /var/tmp/sam.clarke/Vernier/build
      Start  1: SystemTest.TimingTest
 1/21 Test  #1: SystemTest.TimingTest ...................   Passed    5.05 sec
      Start  2: HashEntryTest.TimingsTest
 2/21 Test  #2: HashEntryTest.TimingsTest ...............   Passed    2.05 sec
      Start  3: RegionNameTest.NamesMatchTest
 3/21 Test  #3: RegionNameTest.NamesMatchTest ...........   Passed    0.05 sec
      Start  4: HashTableTest.HashFunctionTest
 4/21 Test  #4: HashTableTest.HashFunctionTest ..........   Passed    0.07 sec
      Start  5: HashTableTest.UpdateTimesTest
 5/21 Test  #5: HashTableTest.UpdateTimesTest ...........   Passed    2.05 sec
      Start  6: HashTableTest.NameLengthTest
 6/21 Test  #6: HashTableTest.NameLengthTest ............   Passed    0.05 sec
      Start  7: DeathTest.WrongHashTest
 7/21 Test  #7: DeathTest.WrongHashTest .................   Passed    0.06 sec
      Start  8: DeathTest.StopBeforeStartTest
 8/21 Test  #8: DeathTest.StopBeforeStartTest ...........   Passed    0.05 sec
      Start  9: DeathTest.StartBeforeInit
 9/21 Test  #9: DeathTest.StartBeforeInit ...............   Passed    0.06 sec
      Start 10: DeathTest.NullCommunicatorPassed
10/21 Test #10: DeathTest.NullCommunicatorPassed ........   Passed    0.06 sec
      Start 11: DeathTest.VernierUninitialisedInWrite
11/21 Test #11: DeathTest.VernierUninitialisedInWrite ...   Passed    0.07 sec
      Start 12: DeathTest.TooManyTracebackEntries
12/21 Test #12: DeathTest.TooManyTracebackEntries .......   Passed    0.07 sec
      Start 13: DeathTest.InvalidIOModeTest
13/21 Test #13: DeathTest.InvalidIOModeTest .............   Passed    0.06 sec
      Start 14: HashEntryTest.CallCountTest
14/21 Test #14: HashEntryTest.CallCountTest .............   Passed    0.07 sec
      Start 15: RecursionTest.DirectRecursion
15/21 Test #15: RecursionTest.DirectRecursion ...........   Passed    3.05 sec
      Start 16: RecursionTest.IndirectRecursion
16/21 Test #16: RecursionTest.IndirectRecursion .........   Passed    5.05 sec
      Start 17: MPINotInitTest.MpiNotInitialised
17/21 Test #17: MPINotInitTest.MpiNotInitialised ........   Passed    0.02 sec
      Start 18: SystemTests_TestTags
18/21 Test #18: SystemTests_TestTags ....................   Passed    0.34 sec
      Start 19: SystemTests_TestOutputs
19/21 Test #19: SystemTests_TestOutputs .................   Passed    0.13 sec
      Start 20: SystemTests_TestVernierMPIHeaders
20/21 Test #20: SystemTests_TestVernierMPIHeaders .......   Passed    0.14 sec
      Start 21: SystemTests_TestMPIOutputs
21/21 Test #21: SystemTests_TestMPIOutputs ..............   Passed    0.15 sec

100% tests passed, 0 tests failed out of 21

Total Test time (real) =  18.72 sec

@t00sa t00sa requested a review from andrewcoughtrie January 8, 2026 20:12
Copy link
Collaborator

@andrewcoughtrie andrewcoughtrie left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good change and fixes the issue. All the tests pass with OpenMPI

@andrewcoughtrie andrewcoughtrie added this pull request to the merge queue Jan 20, 2026
Merged via the queue into MetOffice:main with commit bd58dd1 Jan 20, 2026
10 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants