You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
WARNING: Could not generate an xpmem segment id for this process'
address space.
The vader shared memory BTL will fall back on another single-copy
mechanism if one is available. This may result in lower performance.
Local host: hgpn14
Error code: 2 (No such file or directory)
LAMMPS (29 Aug 2024)
OMP_NUM_THREADS environment is not set. Defaulting to 1 thread. (src/comm.cpp:98)
using 1 OpenMP thread(s) per MPI task
Reading data file ...
orthogonal box = (-50 -50 -0.16266) to (50 50 203.80766)
1 by 1 by 4 MPI processor grid
reading atoms ...
836 atoms
read_data CPU = 0.004 seconds
836 atoms in group carbon_atoms
Displacing atoms ...
Changing box ...
orthogonal box = (-50 -50 -0.16266) to (50 50 203.80766)
orthogonal box = (-50 -50 -0.16266) to (50 50 203.80766)
orthogonal box = (-50 -50 -1000) to (50 50 1000)
CUDA found, setting device type to torch::kCUDA.
CUDA found, setting device type to torch::kCUDA.
CUDA found, setting device type to torch::kCUDA.
CUDA found, setting device type to torch::kCUDA.
Loading MACE model from "./MACE-Carbon-L2-5.0A-f128-bch2-processed_data-5.0A-train-CA-9-val-frac0.6-test_images-E0-average_stagetwo.model-lammps.pt" ... finished.
The torch_float_dtype is: Double
The r_max is: 5.
The model has: 2 layers.
The MACE model atomic numbers are: 6.
The pair_coeff atomic numbers are: 6.
Mapping LAMMPS type 1 (C) to MACE type 1.
Loading MACE model from "./MACE-Carbon-L2-5.0A-f128-bch2-processed_data-5.0A-train-CA-9-val-frac0.6-test_images-E0-average_stagetwo.model-lammps.pt" ... finished.
Loading MACE model from "./MACE-Carbon-L2-5.0A-f128-bch2-processed_data-5.0A-train-CA-9-val-frac0.6-test_images-E0-average_stagetwo.model-lammps.pt" ... finished.
The torch_float_dtype is: Double
The torch_float_dtype is: Double
Loading MACE model from "./MACE-Carbon-L2-5.0A-f128-bch2-processed_data-5.0A-train-CA-9-val-frac0.6-test_images-E0-average_stagetwo.model-lammps.pt" ... finished.
The r_max is: 5.
The r_max is: 5.
The model has: 2 layers.
The model has: 2 layers.
The MACE model atomic numbers are: 6.
The pair_coeff atomic numbers are: 6.
The MACE model atomic numbers are: 6.
The pair_coeff atomic numbers are: 6.
Mapping LAMMPS type 1 (C) to MACE type 1.
Mapping LAMMPS type 1 (C) to MACE type 1.
The torch_float_dtype is: Double
The r_max is: 5.
The model has: 2 layers.
The MACE model atomic numbers are: 6.
The pair_coeff atomic numbers are: 6.
Mapping LAMMPS type 1 (C) to MACE type 1.
------------------stage1------------------
Describe the bug
while
mpirun -np 2 /home/liuyenfu1022/lammps-mace/build-hopper/lmp -in in.langevin
works well on the eight-H100 server
mpirun -np 4 /home/liuyenfu1022/lammps-mace/build-hopper/lmp -in in.langevin
would be fail by the Segmentation fault
the mpirun would be success with 2 GPU, the mpirun fail with more than 2-GPUs like follow example:
$ mpirun -np 4 /home/liuyenfu1022/lammps-mace/build-hopper/lmp -in in.langevin
WARNING: Could not generate an xpmem segment id for this process'
address space.
The vader shared memory BTL will fall back on another single-copy
mechanism if one is available. This may result in lower performance.
Local host: hgpn14
Error code: 2 (No such file or directory)
LAMMPS (29 Aug 2024)
OMP_NUM_THREADS environment is not set. Defaulting to 1 thread. (src/comm.cpp:98)
using 1 OpenMP thread(s) per MPI task
Reading data file ...
orthogonal box = (-50 -50 -0.16266) to (50 50 203.80766)
1 by 1 by 4 MPI processor grid
reading atoms ...
836 atoms
read_data CPU = 0.004 seconds
836 atoms in group carbon_atoms
Displacing atoms ...
Changing box ...
orthogonal box = (-50 -50 -0.16266) to (50 50 203.80766)
orthogonal box = (-50 -50 -0.16266) to (50 50 203.80766)
orthogonal box = (-50 -50 -1000) to (50 50 1000)
CUDA found, setting device type to torch::kCUDA.
CUDA found, setting device type to torch::kCUDA.
CUDA found, setting device type to torch::kCUDA.
CUDA found, setting device type to torch::kCUDA.
Loading MACE model from "./MACE-Carbon-L2-5.0A-f128-bch2-processed_data-5.0A-train-CA-9-val-frac0.6-test_images-E0-average_stagetwo.model-lammps.pt" ... finished.
Loading MACE model from "./MACE-Carbon-L2-5.0A-f128-bch2-processed_data-5.0A-train-CA-9-val-frac0.6-test_images-E0-average_stagetwo.model-lammps.pt" ... finished.
Loading MACE model from "./MACE-Carbon-L2-5.0A-f128-bch2-processed_data-5.0A-train-CA-9-val-frac0.6-test_images-E0-average_stagetwo.model-lammps.pt" ... finished.
Loading MACE model from "./MACE-Carbon-L2-5.0A-f128-bch2-processed_data-5.0A-train-CA-9-val-frac0.6-test_images-E0-average_stagetwo.model-lammps.pt" ... finished.
------------------stage1------------------
CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE
Your simulation uses code contributions which should be cited:
The log file lists these citations in BibTeX format.
CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE
Generated 0 of 0 mixed pair_coeff terms from geometric mixing rule
Neighbor list info ...
update: every = 1 steps, delay = 0 steps, check = yes
max neighbors/atom: 2000, page size: 100000
master list distance cutoff = 12
ghost atom cutoff = 12
binsize = 6, bins = 17 17 334
1 neighbor lists, perpetual/occasional/extra = 1 0 0
(1) pair mace, perpetual
attributes: full, newton on, ghost
pair build: full/bin/ghost
stencil: full/ghost/bin/3d
bin: standard
Setting up Verlet run ...
Unit style : metal
Current step : 0
Time step : 0.001
[hgpn14:2911178:0:2911178] Caught signal 11 (Segmentation fault: address not mapped to object at address (nil))
[hgpn14:2911181:0:2911181] Caught signal 11 (Segmentation fault: address not mapped to object at address (nil))
[hgpn14:2911173] 3 more processes have sent help message help-btl-vader.txt / xpmem-make-failed
[hgpn14:2911173] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
==== backtrace (tid:2911181) ====
0 0x0000000000012cf0 __funlockfile() :0
1 0x0000000000585e92 LAMMPS_NS::PairMACE::compute() tmpxft_001def16_00000000-6_pair_mace.cudafe1.cpp:0
2 0x0000000000b5360a LAMMPS_NS::Verlet::setup() ???:0
3 0x0000000000ac87ab LAMMPS_NS::Run::command() ???:0
4 0x00000000008fb550 LAMMPS_NS::Input::execute_command() ???:0
5 0x00000000008fb8f6 LAMMPS_NS::Input::file() ???:0
6 0x0000000000404cbd main() ???:0
7 0x000000000003ad85 __libc_start_main() ???:0
8 0x0000000000404e6e _start() ???:0
[hgpn14:2911181] *** Process received signal ***
[hgpn14:2911181] Signal: Segmentation fault (11)
[hgpn14:2911181] Signal code: (-6)
[hgpn14:2911181] Failing at address: 0x8da8002c6bcd
[hgpn14:2911181] [ 0] /lib64/libpthread.so.0(+0x12cf0)[0x14b60bc47cf0]
[hgpn14:2911181] [ 1] /home/liuyenfu1022/lammps-mace/build-hopper/liblammps.so.0(+0x585e92)[0x14b60c5e2e92]
[hgpn14:2911181] [ 2] /home/liuyenfu1022/lammps-mace/build-hopper/liblammps.so.0(_ZN9LAMMPS_NS6Verlet5setupEi+0x37a)[0x14b60cbb060a]
[hgpn14:2911181] [ 3] ==== backtrace (tid:2911178) ====
0 0x0000000000012cf0 __funlockfile() :0
1 0x0000000000585e92 LAMMPS_NS::PairMACE::compute() tmpxft_001def16_00000000-6_pair_mace.cudafe1.cpp:0
2 0x0000000000b5360a LAMMPS_NS::Verlet::setup() ???:0
3 0x0000000000ac87ab LAMMPS_NS::Run::command() ???:0
4 0x00000000008fb550 LAMMPS_NS::Input::execute_command() ???:0
5 0x00000000008fb8f6 LAMMPS_NS::Input::file() ???:0
6 0x0000000000404cbd main() ???:0
7 0x000000000003ad85 __libc_start_main() ???:0
8 0x0000000000404e6e _start() ???:0
/home/liuyenfu1022/lammps-mace/build-hopper/liblammps.so.0(_ZN9LAMMPS_NS3Run7commandEiPPc+0xf1b)[0x14b60cb257ab]
[hgpn14:2911181] [ 4] [hgpn14:2911178] *** Process received signal ***
[hgpn14:2911178] Signal: Segmentation fault (11)
[hgpn14:2911178] Signal code: (-6)
[hgpn14:2911178] Failing at address: 0x8da8002c6bca
[hgpn14:2911178] [ 0] /lib64/libpthread.so.0(+0x12cf0)[0x148eaa6adcf0]
[hgpn14:2911178] [ 1] /home/liuyenfu1022/lammps-mace/build-hopper/liblammps.so.0(_ZN9LAMMPS_NS5Input15execute_commandEv+0xb00)[0x14b60c958550]
[hgpn14:2911181] [ 5] /home/liuyenfu1022/lammps-mace/build-hopper/liblammps.so.0(+0x585e92)[0x148eab048e92]
[hgpn14:2911178] [ 2] /home/liuyenfu1022/lammps-mace/build-hopper/liblammps.so.0(_ZN9LAMMPS_NS5Input4fileEv+0x176)[0x14b60c9588f6]
[hgpn14:2911181] [ 6] /home/liuyenfu1022/lammps-mace/build-hopper/lmp[0x404cbd]
[hgpn14:2911181] [ 7] /lib64/libc.so.6(__libc_start_main+0xe5)[0x14b60b110d85]
[hgpn14:2911181] [ 8] /home/liuyenfu1022/lammps-mace/build-hopper/lmp[0x404e6e]
[hgpn14:2911181] *** End of error message ***
/home/liuyenfu1022/lammps-mace/build-hopper/liblammps.so.0(_ZN9LAMMPS_NS6Verlet5setupEi+0x37a)[0x148eab61660a]
[hgpn14:2911178] [ 3] /home/liuyenfu1022/lammps-mace/build-hopper/liblammps.so.0(_ZN9LAMMPS_NS3Run7commandEiPPc+0xf1b)[0x148eab58b7ab]
[hgpn14:2911178] [ 4] /home/liuyenfu1022/lammps-mace/build-hopper/liblammps.so.0(_ZN9LAMMPS_NS5Input15execute_commandEv+0xb00)[0x148eab3be550]
[hgpn14:2911178] [ 5] /home/liuyenfu1022/lammps-mace/build-hopper/liblammps.so.0(_ZN9LAMMPS_NS5Input4fileEv+0x176)[0x148eab3be8f6]
[hgpn14:2911178] [ 6] /home/liuyenfu1022/lammps-mace/build-hopper/lmp[0x404cbd]
[hgpn14:2911178] [ 7] /lib64/libc.so.6(__libc_start_main+0xe5)[0x148ea9b76d85]
[hgpn14:2911178] [ 8] /home/liuyenfu1022/lammps-mace/build-hopper/lmp[0x404e6e]
[hgpn14:2911178] *** End of error message ***
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
mpirun noticed that process rank 3 with PID 0 on node hgpn14 exited on signal 11 (Segmentation fault).
The text was updated successfully, but these errors were encountered: