Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

multi-GPU mpirun Segmentation fault issue #863

Open
liuyenfu opened this issue Mar 14, 2025 · 0 comments
Open

multi-GPU mpirun Segmentation fault issue #863

liuyenfu opened this issue Mar 14, 2025 · 0 comments
Labels

Comments

@liuyenfu
Copy link

liuyenfu commented Mar 14, 2025

Describe the bug

while
mpirun -np 2 /home/liuyenfu1022/lammps-mace/build-hopper/lmp -in in.langevin
works well on the eight-H100 server

mpirun -np 4 /home/liuyenfu1022/lammps-mace/build-hopper/lmp -in in.langevin
would be fail by the Segmentation fault

the mpirun would be success with 2 GPU, the mpirun fail with more than 2-GPUs like follow example:

$ mpirun -np 4 /home/liuyenfu1022/lammps-mace/build-hopper/lmp -in in.langevin


WARNING: Could not generate an xpmem segment id for this process'
address space.

The vader shared memory BTL will fall back on another single-copy
mechanism if one is available. This may result in lower performance.

Local host: hgpn14
Error code: 2 (No such file or directory)


LAMMPS (29 Aug 2024)
OMP_NUM_THREADS environment is not set. Defaulting to 1 thread. (src/comm.cpp:98)
using 1 OpenMP thread(s) per MPI task
Reading data file ...
orthogonal box = (-50 -50 -0.16266) to (50 50 203.80766)
1 by 1 by 4 MPI processor grid
reading atoms ...
836 atoms
read_data CPU = 0.004 seconds
836 atoms in group carbon_atoms
Displacing atoms ...
Changing box ...
orthogonal box = (-50 -50 -0.16266) to (50 50 203.80766)
orthogonal box = (-50 -50 -0.16266) to (50 50 203.80766)
orthogonal box = (-50 -50 -1000) to (50 50 1000)
CUDA found, setting device type to torch::kCUDA.
CUDA found, setting device type to torch::kCUDA.
CUDA found, setting device type to torch::kCUDA.
CUDA found, setting device type to torch::kCUDA.
Loading MACE model from "./MACE-Carbon-L2-5.0A-f128-bch2-processed_data-5.0A-train-CA-9-val-frac0.6-test_images-E0-average_stagetwo.model-lammps.pt" ... finished.

  • The torch_float_dtype is: Double
  • The r_max is: 5.
  • The model has: 2 layers.
  • The MACE model atomic numbers are: 6.
  • The pair_coeff atomic numbers are: 6.
  • Mapping LAMMPS type 1 (C) to MACE type 1.
    Loading MACE model from "./MACE-Carbon-L2-5.0A-f128-bch2-processed_data-5.0A-train-CA-9-val-frac0.6-test_images-E0-average_stagetwo.model-lammps.pt" ... finished.
    Loading MACE model from "./MACE-Carbon-L2-5.0A-f128-bch2-processed_data-5.0A-train-CA-9-val-frac0.6-test_images-E0-average_stagetwo.model-lammps.pt" ... finished.
  • The torch_float_dtype is: Double
  • The torch_float_dtype is: Double
    Loading MACE model from "./MACE-Carbon-L2-5.0A-f128-bch2-processed_data-5.0A-train-CA-9-val-frac0.6-test_images-E0-average_stagetwo.model-lammps.pt" ... finished.
  • The r_max is: 5.
  • The r_max is: 5.
  • The model has: 2 layers.
  • The model has: 2 layers.
  • The MACE model atomic numbers are: 6.
  • The pair_coeff atomic numbers are: 6.
  • The MACE model atomic numbers are: 6.
  • The pair_coeff atomic numbers are: 6.
  • Mapping LAMMPS type 1 (C) to MACE type 1.
  • Mapping LAMMPS type 1 (C) to MACE type 1.
  • The torch_float_dtype is: Double
  • The r_max is: 5.
  • The model has: 2 layers.
  • The MACE model atomic numbers are: 6.
  • The pair_coeff atomic numbers are: 6.
  • Mapping LAMMPS type 1 (C) to MACE type 1.
    ------------------stage1------------------

CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE

Your simulation uses code contributions which should be cited:

CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE

Generated 0 of 0 mixed pair_coeff terms from geometric mixing rule
Neighbor list info ...
update: every = 1 steps, delay = 0 steps, check = yes
max neighbors/atom: 2000, page size: 100000
master list distance cutoff = 12
ghost atom cutoff = 12
binsize = 6, bins = 17 17 334
1 neighbor lists, perpetual/occasional/extra = 1 0 0
(1) pair mace, perpetual
attributes: full, newton on, ghost
pair build: full/bin/ghost
stencil: full/ghost/bin/3d
bin: standard
Setting up Verlet run ...
Unit style : metal
Current step : 0
Time step : 0.001
[hgpn14:2911178:0:2911178] Caught signal 11 (Segmentation fault: address not mapped to object at address (nil))
[hgpn14:2911181:0:2911181] Caught signal 11 (Segmentation fault: address not mapped to object at address (nil))
[hgpn14:2911173] 3 more processes have sent help message help-btl-vader.txt / xpmem-make-failed
[hgpn14:2911173] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
==== backtrace (tid:2911181) ====
0 0x0000000000012cf0 __funlockfile() :0
1 0x0000000000585e92 LAMMPS_NS::PairMACE::compute() tmpxft_001def16_00000000-6_pair_mace.cudafe1.cpp:0
2 0x0000000000b5360a LAMMPS_NS::Verlet::setup() ???:0
3 0x0000000000ac87ab LAMMPS_NS::Run::command() ???:0
4 0x00000000008fb550 LAMMPS_NS::Input::execute_command() ???:0
5 0x00000000008fb8f6 LAMMPS_NS::Input::file() ???:0
6 0x0000000000404cbd main() ???:0
7 0x000000000003ad85 __libc_start_main() ???:0
8 0x0000000000404e6e _start() ???:0

[hgpn14:2911181] *** Process received signal ***
[hgpn14:2911181] Signal: Segmentation fault (11)
[hgpn14:2911181] Signal code: (-6)
[hgpn14:2911181] Failing at address: 0x8da8002c6bcd
[hgpn14:2911181] [ 0] /lib64/libpthread.so.0(+0x12cf0)[0x14b60bc47cf0]
[hgpn14:2911181] [ 1] /home/liuyenfu1022/lammps-mace/build-hopper/liblammps.so.0(+0x585e92)[0x14b60c5e2e92]
[hgpn14:2911181] [ 2] /home/liuyenfu1022/lammps-mace/build-hopper/liblammps.so.0(_ZN9LAMMPS_NS6Verlet5setupEi+0x37a)[0x14b60cbb060a]
[hgpn14:2911181] [ 3] ==== backtrace (tid:2911178) ====
0 0x0000000000012cf0 __funlockfile() :0
1 0x0000000000585e92 LAMMPS_NS::PairMACE::compute() tmpxft_001def16_00000000-6_pair_mace.cudafe1.cpp:0
2 0x0000000000b5360a LAMMPS_NS::Verlet::setup() ???:0
3 0x0000000000ac87ab LAMMPS_NS::Run::command() ???:0
4 0x00000000008fb550 LAMMPS_NS::Input::execute_command() ???:0
5 0x00000000008fb8f6 LAMMPS_NS::Input::file() ???:0
6 0x0000000000404cbd main() ???:0
7 0x000000000003ad85 __libc_start_main() ???:0
8 0x0000000000404e6e _start() ???:0

/home/liuyenfu1022/lammps-mace/build-hopper/liblammps.so.0(_ZN9LAMMPS_NS3Run7commandEiPPc+0xf1b)[0x14b60cb257ab]
[hgpn14:2911181] [ 4] [hgpn14:2911178] *** Process received signal ***
[hgpn14:2911178] Signal: Segmentation fault (11)
[hgpn14:2911178] Signal code: (-6)
[hgpn14:2911178] Failing at address: 0x8da8002c6bca
[hgpn14:2911178] [ 0] /lib64/libpthread.so.0(+0x12cf0)[0x148eaa6adcf0]
[hgpn14:2911178] [ 1] /home/liuyenfu1022/lammps-mace/build-hopper/liblammps.so.0(_ZN9LAMMPS_NS5Input15execute_commandEv+0xb00)[0x14b60c958550]
[hgpn14:2911181] [ 5] /home/liuyenfu1022/lammps-mace/build-hopper/liblammps.so.0(+0x585e92)[0x148eab048e92]
[hgpn14:2911178] [ 2] /home/liuyenfu1022/lammps-mace/build-hopper/liblammps.so.0(_ZN9LAMMPS_NS5Input4fileEv+0x176)[0x14b60c9588f6]
[hgpn14:2911181] [ 6] /home/liuyenfu1022/lammps-mace/build-hopper/lmp[0x404cbd]
[hgpn14:2911181] [ 7] /lib64/libc.so.6(__libc_start_main+0xe5)[0x14b60b110d85]
[hgpn14:2911181] [ 8] /home/liuyenfu1022/lammps-mace/build-hopper/lmp[0x404e6e]
[hgpn14:2911181] *** End of error message ***
/home/liuyenfu1022/lammps-mace/build-hopper/liblammps.so.0(_ZN9LAMMPS_NS6Verlet5setupEi+0x37a)[0x148eab61660a]
[hgpn14:2911178] [ 3] /home/liuyenfu1022/lammps-mace/build-hopper/liblammps.so.0(_ZN9LAMMPS_NS3Run7commandEiPPc+0xf1b)[0x148eab58b7ab]
[hgpn14:2911178] [ 4] /home/liuyenfu1022/lammps-mace/build-hopper/liblammps.so.0(_ZN9LAMMPS_NS5Input15execute_commandEv+0xb00)[0x148eab3be550]
[hgpn14:2911178] [ 5] /home/liuyenfu1022/lammps-mace/build-hopper/liblammps.so.0(_ZN9LAMMPS_NS5Input4fileEv+0x176)[0x148eab3be8f6]
[hgpn14:2911178] [ 6] /home/liuyenfu1022/lammps-mace/build-hopper/lmp[0x404cbd]
[hgpn14:2911178] [ 7] /lib64/libc.so.6(__libc_start_main+0xe5)[0x148ea9b76d85]
[hgpn14:2911178] [ 8] /home/liuyenfu1022/lammps-mace/build-hopper/lmp[0x404e6e]
[hgpn14:2911178] *** End of error message ***

Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.


mpirun noticed that process rank 3 with PID 0 on node hgpn14 exited on signal 11 (Segmentation fault).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants