multi-GPU mpirun Segmentation fault issue #863

liuyenfu · 2025-03-14T09:34:14Z

Describe the bug

while
mpirun -np 2 /home/liuyenfu1022/lammps-mace/build-hopper/lmp -in in.langevin
works well on the eight-H100 server

mpirun -np 4 /home/liuyenfu1022/lammps-mace/build-hopper/lmp -in in.langevin
would be fail by the Segmentation fault

the mpirun would be success with 2 GPU, the mpirun fail with more than 2-GPUs like follow example:

$ mpirun -np 4 /home/liuyenfu1022/lammps-mace/build-hopper/lmp -in in.langevin

WARNING: Could not generate an xpmem segment id for this process'
address space.

The vader shared memory BTL will fall back on another single-copy
mechanism if one is available. This may result in lower performance.

Local host: hgpn14
Error code: 2 (No such file or directory)

LAMMPS (29 Aug 2024)
OMP_NUM_THREADS environment is not set. Defaulting to 1 thread. (src/comm.cpp:98)
using 1 OpenMP thread(s) per MPI task
Reading data file ...
orthogonal box = (-50 -50 -0.16266) to (50 50 203.80766)
1 by 1 by 4 MPI processor grid
reading atoms ...
836 atoms
read_data CPU = 0.004 seconds
836 atoms in group carbon_atoms
Displacing atoms ...
Changing box ...
orthogonal box = (-50 -50 -0.16266) to (50 50 203.80766)
orthogonal box = (-50 -50 -0.16266) to (50 50 203.80766)
orthogonal box = (-50 -50 -1000) to (50 50 1000)
CUDA found, setting device type to torch::kCUDA.
CUDA found, setting device type to torch::kCUDA.
CUDA found, setting device type to torch::kCUDA.
CUDA found, setting device type to torch::kCUDA.
Loading MACE model from "./MACE-Carbon-L2-5.0A-f128-bch2-processed_data-5.0A-train-CA-9-val-frac0.6-test_images-E0-average_stagetwo.model-lammps.pt" ... finished.

The torch_float_dtype is: Double
The r_max is: 5.
The model has: 2 layers.
The MACE model atomic numbers are: 6.
The pair_coeff atomic numbers are: 6.
Mapping LAMMPS type 1 (C) to MACE type 1.
Loading MACE model from "./MACE-Carbon-L2-5.0A-f128-bch2-processed_data-5.0A-train-CA-9-val-frac0.6-test_images-E0-average_stagetwo.model-lammps.pt" ... finished.
Loading MACE model from "./MACE-Carbon-L2-5.0A-f128-bch2-processed_data-5.0A-train-CA-9-val-frac0.6-test_images-E0-average_stagetwo.model-lammps.pt" ... finished.
The torch_float_dtype is: Double
The torch_float_dtype is: Double
Loading MACE model from "./MACE-Carbon-L2-5.0A-f128-bch2-processed_data-5.0A-train-CA-9-val-frac0.6-test_images-E0-average_stagetwo.model-lammps.pt" ... finished.
The r_max is: 5.
The r_max is: 5.
The model has: 2 layers.
The model has: 2 layers.
The MACE model atomic numbers are: 6.
The pair_coeff atomic numbers are: 6.
The MACE model atomic numbers are: 6.
The pair_coeff atomic numbers are: 6.
Mapping LAMMPS type 1 (C) to MACE type 1.
Mapping LAMMPS type 1 (C) to MACE type 1.
The torch_float_dtype is: Double
The r_max is: 5.
The model has: 2 layers.
The MACE model atomic numbers are: 6.
The pair_coeff atomic numbers are: 6.
Mapping LAMMPS type 1 (C) to MACE type 1.
------------------stage1------------------

CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE

Your simulation uses code contributions which should be cited:

Type Label Framework: https://doi.org/10.1021/acs.jpcb.3c08419
The log file lists these citations in BibTeX format.

CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE

Generated 0 of 0 mixed pair_coeff terms from geometric mixing rule
Neighbor list info ...
update: every = 1 steps, delay = 0 steps, check = yes
max neighbors/atom: 2000, page size: 100000
master list distance cutoff = 12
ghost atom cutoff = 12
binsize = 6, bins = 17 17 334
1 neighbor lists, perpetual/occasional/extra = 1 0 0
(1) pair mace, perpetual
attributes: full, newton on, ghost
pair build: full/bin/ghost
stencil: full/ghost/bin/3d
bin: standard
Setting up Verlet run ...
Unit style : metal
Current step : 0
Time step : 0.001
[hgpn14:2911178:0:2911178] Caught signal 11 (Segmentation fault: address not mapped to object at address (nil))
[hgpn14:2911181:0:2911181] Caught signal 11 (Segmentation fault: address not mapped to object at address (nil))
[hgpn14:2911173] 3 more processes have sent help message help-btl-vader.txt / xpmem-make-failed
[hgpn14:2911173] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
==== backtrace (tid:2911181) ====
0 0x0000000000012cf0 __funlockfile() :0
1 0x0000000000585e92 LAMMPS_NS::PairMACE::compute() tmpxft_001def16_00000000-6_pair_mace.cudafe1.cpp:0
2 0x0000000000b5360a LAMMPS_NS::Verlet::setup() ???:0
3 0x0000000000ac87ab LAMMPS_NS::Run::command() ???:0
4 0x00000000008fb550 LAMMPS_NS::Input::execute_command() ???:0
5 0x00000000008fb8f6 LAMMPS_NS::Input::file() ???:0
6 0x0000000000404cbd main() ???:0
7 0x000000000003ad85 __libc_start_main() ???:0
8 0x0000000000404e6e _start() ???:0

[hgpn14:2911181] * Process received signal *
[hgpn14:2911181] Signal: Segmentation fault (11)
[hgpn14:2911181] Signal code: (-6)
[hgpn14:2911181] Failing at address: 0x8da8002c6bcd
[hgpn14:2911181] [ 0] /lib64/libpthread.so.0(+0x12cf0)[0x14b60bc47cf0]
[hgpn14:2911181] [ 1] /home/liuyenfu1022/lammps-mace/build-hopper/liblammps.so.0(+0x585e92)[0x14b60c5e2e92]
[hgpn14:2911181] [ 2] /home/liuyenfu1022/lammps-mace/build-hopper/liblammps.so.0(_ZN9LAMMPS_NS6Verlet5setupEi+0x37a)[0x14b60cbb060a]
[hgpn14:2911181] [ 3] ==== backtrace (tid:2911178) ====
0 0x0000000000012cf0 __funlockfile() :0
1 0x0000000000585e92 LAMMPS_NS::PairMACE::compute() tmpxft_001def16_00000000-6_pair_mace.cudafe1.cpp:0
2 0x0000000000b5360a LAMMPS_NS::Verlet::setup() ???:0
3 0x0000000000ac87ab LAMMPS_NS::Run::command() ???:0
4 0x00000000008fb550 LAMMPS_NS::Input::execute_command() ???:0
5 0x00000000008fb8f6 LAMMPS_NS::Input::file() ???:0
6 0x0000000000404cbd main() ???:0
7 0x000000000003ad85 __libc_start_main() ???:0
8 0x0000000000404e6e _start() ???:0

/home/liuyenfu1022/lammps-mace/build-hopper/liblammps.so.0(_ZN9LAMMPS_NS3Run7commandEiPPc+0xf1b)[0x14b60cb257ab]
[hgpn14:2911181] [ 4] [hgpn14:2911178] * Process received signal *
[hgpn14:2911178] Signal: Segmentation fault (11)
[hgpn14:2911178] Signal code: (-6)
[hgpn14:2911178] Failing at address: 0x8da8002c6bca
[hgpn14:2911178] [ 0] /lib64/libpthread.so.0(+0x12cf0)[0x148eaa6adcf0]
[hgpn14:2911178] [ 1] /home/liuyenfu1022/lammps-mace/build-hopper/liblammps.so.0(_ZN9LAMMPS_NS5Input15execute_commandEv+0xb00)[0x14b60c958550]
[hgpn14:2911181] [ 5] /home/liuyenfu1022/lammps-mace/build-hopper/liblammps.so.0(+0x585e92)[0x148eab048e92]
[hgpn14:2911178] [ 2] /home/liuyenfu1022/lammps-mace/build-hopper/liblammps.so.0(_ZN9LAMMPS_NS5Input4fileEv+0x176)[0x14b60c9588f6]
[hgpn14:2911181] [ 6] /home/liuyenfu1022/lammps-mace/build-hopper/lmp[0x404cbd]
[hgpn14:2911181] [ 7] /lib64/libc.so.6(__libc_start_main+0xe5)[0x14b60b110d85]
[hgpn14:2911181] [ 8] /home/liuyenfu1022/lammps-mace/build-hopper/lmp[0x404e6e]
[hgpn14:2911181] * End of error message *
/home/liuyenfu1022/lammps-mace/build-hopper/liblammps.so.0(_ZN9LAMMPS_NS6Verlet5setupEi+0x37a)[0x148eab61660a]
[hgpn14:2911178] [ 3] /home/liuyenfu1022/lammps-mace/build-hopper/liblammps.so.0(_ZN9LAMMPS_NS3Run7commandEiPPc+0xf1b)[0x148eab58b7ab]
[hgpn14:2911178] [ 4] /home/liuyenfu1022/lammps-mace/build-hopper/liblammps.so.0(_ZN9LAMMPS_NS5Input15execute_commandEv+0xb00)[0x148eab3be550]
[hgpn14:2911178] [ 5] /home/liuyenfu1022/lammps-mace/build-hopper/liblammps.so.0(_ZN9LAMMPS_NS5Input4fileEv+0x176)[0x148eab3be8f6]
[hgpn14:2911178] [ 6] /home/liuyenfu1022/lammps-mace/build-hopper/lmp[0x404cbd]
[hgpn14:2911178] [ 7] /lib64/libc.so.6(__libc_start_main+0xe5)[0x148ea9b76d85]
[hgpn14:2911178] [ 8] /home/liuyenfu1022/lammps-mace/build-hopper/lmp[0x404e6e]
[hgpn14:2911178] * End of error message *

Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.

mpirun noticed that process rank 3 with PID 0 on node hgpn14 exited on signal 11 (Segmentation fault).

ilyes319 added the lammps label Mar 14, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

multi-GPU mpirun Segmentation fault issue #863

multi-GPU mpirun Segmentation fault issue #863

liuyenfu commented Mar 14, 2025 •

edited

Loading

multi-GPU mpirun Segmentation fault issue #863

multi-GPU mpirun Segmentation fault issue #863

Comments

liuyenfu commented Mar 14, 2025 • edited Loading

Primary job terminated normally, but 1 process returned a non-zero exit code. Per user-direction, the job has been aborted.

liuyenfu commented Mar 14, 2025 •

edited

Loading

Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.