You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
During the build of OpenMPI-4.1.1 with the foss-2021a toolchain I get the following error:
[1636483897.276373] [ip-AC125812:109544:0] mm_ep.c:154 UCX ERROR mm ep failed to connect to remote FIFO id 0xc00000084001abe5: Shared memory error
[ip-AC125812:109544] pml_ucx.c:419 Error: ucp_ep_create(proc=0) failed: Shared memory error
[1636483897.280964] [ip-AC125812:109542:0] mm_posix.c:194 UCX ERROR open(file_name=/proc/109541/fd/33 flags=0x0) failed: No such file or directory
[1636483897.281006] [ip-AC125812:109542:0] mm_ep.c:154 UCX ERROR mm ep failed to connect to remote FIFO id 0xc00000084001abe5: Shared memory error
[ip-AC125812:109542] pml_ucx.c:419 Error: ucp_ep_create(proc=0) failed: Shared memory error
[1636483897.281576] [ip-AC125812:109543:0] mm_posix.c:194 UCX ERROR open(file_name=/proc/109541/fd/33 flags=0x0) failed: No such file or directory
[1636483897.281602] [ip-AC125812:109543:0] mm_ep.c:154 UCX ERROR mm ep failed to connect to remote FIFO id 0xc00000084001abe5: Shared memory error
[ip-AC125812:109543] pml_ucx.c:419 Error: ucp_ep_create(proc=0) failed: Shared memory error
--------------------------------------------------------------------------
It looks like MPI_INIT failed for some reason; your parallel process is
likely to abort. There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or environment
problems. This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
developer):
PML add procs failed
--> Returned "Error" (-1) instead of "Success" (0)
--------------------------------------------------------------------------
[ip-AC125812:109544] *** An error occurred in MPI_Init
[ip-AC125812:109544] *** reported by process [2187460609,3]
[ip-AC125812:109544] *** on a NULL communicator
[ip-AC125812:109544] *** Unknown error
[ip-AC125812:109544] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[ip-AC125812:109544] *** and potentially your MPI job)
[ip-AC125812:109519] 2 more processes have sent help message help-mpi-runtime.txt / mpi_init:startup:internal-failure
[ip-AC125812:109519] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
[ip-AC125812:109519] 2 more processes have sent help message help-mpi-errors.txt / mpi_errors_are_fatal unknown handle
) (at easybuild/framework/easyblock.py:3311 in _sanity_check_step)
== 2021-11-09 18:51:42,522 build_log.py:265 INFO ... (took 17 secs)
== 2021-11-09 18:51:42,522 filetools.py:1971 INFO Removing lock /scratch/cor22/bin/BUILD/EB/janus_easybuild/instances/hbv2/2021a/software/.locks/_scratch_cor22_bin_BUILD_EB_janus_easybuild_instances_hbv2_2021a_software_OpenMPI_4.1.1-GCC-10.3.0.lock...
== 2021-11-09 18:51:42,530 filetools.py:380 INFO Path /scratch/cor22/bin/BUILD/EB/janus_easybuild/instances/hbv2/2021a/software/.locks/_scratch_cor22_bin_BUILD_EB_janus_easybuild_instances_hbv2_2021a_software_OpenMPI_4.1.1-GCC-10.3.0.lock successfully removed.
== 2021-11-09 18:51:42,530 filetools.py:1975 INFO Lock removed: /scratch/cor22/bin/BUILD/EB/janus_easybuild/instances/hbv2/2021a/software/.locks/_scratch_cor22_bin_BUILD_EB_janus_easybuild_instances_hbv2_2021a_software_OpenMPI_4.1.1-GCC-10.3.0.lock
== 2021-11-09 18:51:42,530 easyblock.py:3915 WARNING build failed (first 300 chars): Sanity check failed: sanity check command mpirun -n 4 /scratch/cor22/bin/BUILD/EB/janus_easybuild/instances/hbv2/2021a/build/OpenMPI/4.1.1/GCC-10.3.0/mpi_test_hello_usempi exited with code 1 (output: [1636483897.276312] [ip-AC125812:109544:0] mm_posix.c:194 UCX ERROR open(file_name=/proc/109
== 2021-11-09 18:51:42,531 easyblock.py:307 INFO Closing log for application name OpenMPI version 4.1.1
Looks like it is to do with UCX trying to open up a non-existent file.
Has anyone seen this error and knows a fix?
The text was updated successfully, but these errors were encountered:
@boegel I just hit the exact same issue with the foss-2021b toolchain. The strange thing is that it happens on certain machines while on others it builds smoothly.
@connorourke On which hardware were you trying to build it?
During the build of OpenMPI-4.1.1 with the foss-2021a toolchain I get the following error:
Looks like it is to do with UCX trying to open up a non-existent file.
Has anyone seen this error and knows a fix?
The text was updated successfully, but these errors were encountered: