-
Notifications
You must be signed in to change notification settings - Fork 442
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cannot run large-scale MPI jobs with collectives with UCX 1.18 + OpenMPI 5 #10522
Comments
The attached log is truncated. Is it a regression? (i.e. did you try to run the same app with older OMPI/UCX) |
The full log is 300MB. All processes printed the same info. I can try to extract it from one. I can run the code at this scale with UCX 1.15 and OpenMPI 4.1.4 but we had to build OpenMPI --withoutverbs or it would also hang and/or segfault. The same code also runs without problems with Intel/IntelMPI 2025 (the newer LLVM-based OneAPI). I don't know what transport layer they (intelmpi) are using, maybe pmi2. But there are no application-based memory issues for either of these other MPI distributions. I have attached the source. It's in Fortran. I have a C++ version since these are example programs I have used to teach MPI, but I'm using the Fortran one for testing because I'm more comfortable with it. I had to rename it (sorry) since github doesn't seem to allow files ending in .f90 |
I have winnowed down the debug log to about 1/10 the size by confining it to one node (the job requested 80 cores on each of 10 nodes). I do not know whether this output is more helpful but at least it seems manageable. |
Some more information: I compiled OpenMPI 5.0.7 with hcoll disabled. This build ran the job, but all processes were scheduled on one core on each node. (We use Slurm and OpenMPI was built --with-slurm) I added an environment variable
|
Describe the bug
Running an with OpenMPI 5.0.7 with a simple example code doing a collective (mpi_allgather) I get, from each process,
rcache.c:247 UCX ERROR mmap(size=151552) failed: Cannot allocate memory
rcache.c:674 UCX ERROR Failed to allocate invalidation entry for 0x149cb0ca4000..0x149cb4000000, data corruption may occur
eventually reaching
array.c:44 UCX ERROR failed to grow &worker->ep_config from 0 to 32 elems of 7744 bytes
Program received signal SIGSEGV: Segmentation fault - invalid memory reference.
Steps to Reproduce
In a slurm script:
mpirun -mca btl ^uct ./mpi_ag
ucx_info -v
)UCX 1.18.0
export UCX_IB_RCACHE_MAX_REGIONS="262144"
export SLURM_CPU_BIND_TYPE='cores'
export OMPI_MCA_btl='^openib'
export OMP_
Setup and versions
cat /etc/issue
orcat /etc/redhat-release
+uname -a
Rocky Linux release 8.9 (Green Obsidian)
Linux udc-ba38-32c0 4.18.0-513.18.1.el8_9.x86_64 Add basic types and functions, initial makefile, and smoke test. #1 SMP Wed Feb 21 21:34:36 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux
cat /etc/mlnx-release
(the string identifies software and firmware setup)Driver version:
rpm -q rdma-core
orrpm -q libibverbs
rdma-core-2307mlnx47-1.2310213.x86_64
MLNX_OFED_LINUX-23.10-2.1.3.1:
or: MLNX_OFED version
ofed_info -s
HW information from
ibstat
oribv_devinfo -vv
commandCA 'mlx5_2'
CA type: MT4129
Number of ports: 1
Firmware version: 28.38.1002
Hardware version: 0
Node GUID: 0xa088c20300c835dc
System image GUID: 0xa088c20300c835dc
Port 1:
State: Active
Physical state: LinkUp
Rate: 400
Base lid: 592
LMC: 0
SM lid: 723
Capability mask: 0xa751e848
Port GUID: 0xa088c20300c835dc
Link layer: InfiniBand
CA 'mlx5_bond_0'
CA type: MT4127
Number of ports: 1
Firmware version: 26.38.1002
Hardware version: 0
Node GUID: 0x387c760300948587
System image GUID: 0x387c760300948587
Port 1:
State: Active
Physical state: LinkUp
Rate: 25
Base lid: 0
LMC: 0
SM lid: 0
Capability mask: 0x00010000
Port GUID: 0x3a7c76fffe948587
Link layer: Ethernet
For GPU related issues:
lsmod|grep nv_peer_mem
and/or gdrcopy:lsmod|grep gdrdrv
Not building with GPU in this case
Additional information (depending on the issue)
OpenMPI version
5.0.7
This is only happening for collective communications at large scales (e.g. 800 cores). Smaller jobs with collectives work (e.g. 160 processes). The nodes have NDR Infiniband hardware.
Output of
ucx_info -d
to show transports and devices recognized by UCXattached below
Configure result - config.log
This is the configure part from the Easybuild log. It is from a build without the logging enabled; for that disable-logging --disable-debug --disable-assertions would be absent and --enable-logging present.
./configure --prefix=/apps/software/standard/compiler/gcc/14.2.0/ucx/1.18.0
--build=x86_64-pc-linux-gnu --host=x86_64-pc-linux-gnu --enable-optimizations -
-enable-cma --enable-mt --with-verbs --without-java --without-go --disable-doxyg
en-doc --disable-logging --disable-debug --disable-assertions --disable-params-c
heck
== 2025-02-20 15:42:00,883 run.py:703 INFO parse_log_for_error (some may be harm
less) regExp (?<![(,-]|\w)(?:error|segmentation fault|failed)(?![(,-]|.?\w) fou
nd:
configure: XPMEM - failed to open the requested location (guess), guessing ...
== 2025-02-20 15:42:00,883 run.py:660 WARNING Found 1 errors in command output (
output: configure: XPMEM - failed to open the requested location (guess), guessi
ng ...)
== 2025-02-20 15:42:00,884 build_log.py:267 INFO ... (took 20 secs)
== 2025-02-20 15:42:00,884 build_log.py:267 INFO building...
== 2025-02-20 15:42:00,884 easyblock.py:3901 INFO Starting build step
== 2025-02-20 15:42:00,884 easyconfig.py:1690 INFO Generating template values...
== 2025-02-20 15:42:00,884 easyconfig.py:1709 INFO Template values: arch='x86_64
', bitbucket_account='ucx', builddir='/tmp/uvacse/UCX/1.18.0/GCC-14.2.0', github
_account='ucx', installdir='/apps/software/standard/compiler/gcc/14.2.0/ucx/1.18
.0', module_name='ucx/1.18.0', name='UCX', nameletter='U', nameletterlower='u',
namelower='ucx', parallel='40', start_dir='/tmp/uvacse/UCX/1.18.0/GCC-14.2.0/ucx
-1.18.0/', toolchain_name='GCC', toolchain_version='14.2.0', version='1.18.0', v
ersion_major='1', version_major_minor='1.18', version_minor='18', versionprefix=
'', versionsuffix=''
== 2025-02-20 15:42:00,884 easyblock.py:3909 INFO Running method build_step part
of step build
== 2025-02-20 15:42:00,884 configuremake.py:350 INFO Building target ''
== 2025-02-20 15:42:00,884 run.py:236 INFO running cmd: make -j 40 V=1
== 2025-02-20 15:43:20,735 run.py:648 INFO cmd " make -j 40 V=1" exited with ex
it code 0 and output:
Log file - configure UCX with "--enable-logging" - and run with "UCX_LOG_LEVEL=data"
abbreviated version attached
ucx.txt
ucx_short.log
The text was updated successfully, but these errors were encountered: