This repository has been archived by the owner on Nov 4, 2024. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 0
/
test_run_allreduce_mpi_selected_ucx_tl.log
85 lines (84 loc) · 5.97 KB
/
test_run_allreduce_mpi_selected_ucx_tl.log
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
/# mpirun --allow-run-as-root -np 2 -H <rocm_ip>,<cuda_ip> -mca pml ucx -mca coll_ucc_enable 1 -mca coll_ucc_priority 100 -mca coll_ucc_verbose 3 -mca pml_ucx_verbose 3 -mca pml_ucx_tls tcp -mca pml_ucx_devices ens4 /test_allreduce
[ip-cuda:00498] common_ucx.c:183 using OPAL memory hooks as external events
[ip-cuda:00498] pml_ucx.c:207 mca_pml_ucx_open: UCX version 1.18.0
[ip-rocm:01577] common_ucx.c:183 using OPAL memory hooks as external events
[ip-rocm:01577] pml_ucx.c:207 mca_pml_ucx_open: UCX version 1.18.0
[ip-rocm:01577] common_ucx.c:342 self/memory: did not match transport list
[ip-rocm:01577] common_ucx.c:227 readlink(/sys/class/infiniband/ens5/device/driver) failed: No such file or directory
[ip-rocm:01577] common_ucx.c:337 tcp/ens5: matched transport list but not device list
[ip-rocm:01577] common_ucx.c:227 readlink(/sys/class/infiniband/lo/device/driver) failed: No such file or directory
[ip-rocm:01577] common_ucx.c:337 tcp/lo: matched transport list but not device list
[ip-rocm:01577] common_ucx.c:342 sysv/memory: did not match transport list
[ip-rocm:01577] common_ucx.c:342 posix/memory: did not match transport list
[ip-rocm:01577] common_ucx.c:342 rocm_copy/rocm_cpy: did not match transport list
[ip-rocm:01577] common_ucx.c:342 rocm_ipc/rocm_ipc: did not match transport list
[ip-rocm:01577] common_ucx.c:342 cma/memory: did not match transport list
[ip-rocm:01577] common_ucx.c:347 support level is transports only
[ip-rocm:01577] pml_ucx.c:293 mca_pml_ucx_init
[ip-rocm:01577] pml_ucx.c:124 Pack remote worker address, size 82
[ip-rocm:01577] pml_ucx.c:124 Pack local worker address, size 302
[ip-rocm:01577] pml_ucx.c:358 created ucp context 0x1d40060, worker 0x1d7cd90
[ip-rocm:01577] pml_ucx_component.c:147 returning priority 19
[ip-cuda:00498] common_ucx.c:342 self/memory: did not match transport list
[ip-cuda:00498] common_ucx.c:227 readlink(/sys/class/infiniband/ens5/device/driver) failed: No such file or directory
[ip-cuda:00498] common_ucx.c:337 tcp/ens5: matched transport list but not device list
[ip-cuda:00498] common_ucx.c:227 readlink(/sys/class/infiniband/lo/device/driver) failed: No such file or directory
[ip-cuda:00498] common_ucx.c:337 tcp/lo: matched transport list but not device list
[ip-cuda:00498] common_ucx.c:342 sysv/memory: did not match transport list
[ip-cuda:00498] common_ucx.c:342 posix/memory: did not match transport list
[ip-cuda:00498] common_ucx.c:342 cuda_copy/cuda: did not match transport list
[ip-cuda:00498] common_ucx.c:342 cuda_ipc/cuda: did not match transport list
[ip-cuda:00498] common_ucx.c:342 cma/memory: did not match transport list
[ip-cuda:00498] common_ucx.c:347 support level is transports only
[ip-cuda:00498] pml_ucx.c:293 mca_pml_ucx_init
[ip-cuda:00498] pml_ucx.c:124 Pack remote worker address, size 82
[ip-cuda:00498] pml_ucx.c:124 Pack local worker address, size 300
[ip-cuda:00498] pml_ucx.c:358 created ucp context 0x5dca8e01a9f0, worker 0x5dca8e092860
[ip-cuda:00498] pml_ucx_component.c:147 returning priority 19
[ip-cuda:00498] pml_ucx.c:192 Got proc 1 address, size 300
[ip-cuda:00498] pml_ucx.c:423 connecting to proc. 1
[ip-rocm:01577] pml_ucx.c:192 Got proc 0 address, size 302
[ip-rocm:01577] pml_ucx.c:423 connecting to proc. 0
[ip-rocm:01577] pml_ucx.c:192 Got proc 1 address, size 82
[ip-rocm:01577] pml_ucx.c:423 connecting to proc. 1
[ip-cuda:00498] pml_ucx.c:192 Got proc 0 address, size 82
[ip-cuda:00498] pml_ucx.c:423 connecting to proc. 0
[ip-cuda:00498] coll_ucc_module.c:383 - mca_coll_ucc_init_ctx() initialized ucc context
[ip-cuda:00498] coll_ucc_module.c:469 - mca_coll_ucc_module_enable() creating ucc_team for comm 0x5dca8d722880, comm_id 0, comm_size 2
[ip-rocm:01577] coll_ucc_module.c:383 - mca_coll_ucc_init_ctx() initialized ucc context
[ip-rocm:01577] coll_ucc_module.c:469 - mca_coll_ucc_module_enable() creating ucc_team for comm 0x2045a0, comm_id 0, comm_size 2
ROCm Rank 0
CUDA Rank 1
[ip-cuda:00498] coll_ucc_allreduce.c:69 - mca_coll_ucc_allreduce() running ucc allreduce
[ip-rocm:01577] coll_ucc_allreduce.c:69 - mca_coll_ucc_allreduce() running ucc allreduce
[ip-rocm:1577 :0:1577] Caught signal 11 (Segmentation fault: invalid permissions for mapped object at address 0x7f175e601020)
==== backtrace (tid: 1577) ====
0 /usr/lib/libucs.so.0(ucs_handle_error+0x2dc) [0x7f18766d4aac]
1 /usr/lib/libucs.so.0(+0x3cc8f) [0x7f18766d4c8f]
2 /usr/lib/libucs.so.0(+0x3cfc4) [0x7f18766d4fc4]
3 /lib/x86_64-linux-gnu/libc.so.6(+0x42520) [0x7f187696a520]
4 /lib/x86_64-linux-gnu/libc.so.6(+0x1a0977) [0x7f1876ac8977]
5 /usr/lib/ucx/libuct_rocm.so.0(uct_rocm_copy_ep_put_short+0x37) [0x7f186cccd7c7]
6 /usr/lib/libucp.so.0(ucp_mem_type_unpack+0x17a) [0x7f187679e57a]
7 /usr/lib/libucp.so.0(ucp_eager_only_handler+0xfa6) [0x7f187681fee6]
8 /usr/lib/libuct.so.0(+0x29f2e) [0x7f186d212f2e]
9 /usr/lib/libuct.so.0(+0x2ac78) [0x7f186d213c78]
10 /usr/lib/libuct.so.0(+0x2eda4) [0x7f186d217da4]
11 /usr/lib/libucs.so.0(ucs_event_set_wait+0x141) [0x7f18766e6951]
12 /usr/lib/libuct.so.0(uct_tcp_iface_progress+0x90) [0x7f186d217e90]
13 /usr/lib/libucs.so.0(+0x2cf27) [0x7f18766c4f27]
14 /usr/lib/libucs.so.0(+0x2d47b) [0x7f18766c547b]
15 /usr/lib/libucp.so.0(ucp_worker_progress+0x7a) [0x7f1876797bda]
16 /usr/lib/ucc/libucc_tl_ucp.so(ucc_tl_ucp_reduce_scatter_knomial_progress+0x225) [0x7f176b49a6c5]
17 /usr/lib/ucc/libucc_tl_ucp.so(ucc_tl_ucp_reduce_scatter_knomial_start+0x190) [0x7f176b49c230]
18 /usr/lib/libucc.so.1(ucc_dependency_handler+0x5c) [0x7f18769026dc]
19 /usr/lib/libucc.so.1(ucc_event_manager_notify+0x63) [0x7f18769013b3]
20 /usr/lib/libucc.so.1(ucc_event_manager_notify+0x63) [0x7f18769013b3]
21 /usr/lib/libucc.so.1(ucc_collective_post+0xab) [0x7f18768fe33b]
22 /usr/lib/libmpi.so.40(mca_coll_ucc_allreduce+0xf2) [0x7f18788948e2]
23 /usr/lib/libmpi.so.40(PMPI_Allreduce+0x104) [0x7f1878807e44]
24 /test_allreduce() [0x201f8c]
25 /lib/x86_64-linux-gnu/libc.so.6(+0x29d90) [0x7f1876951d90]
26 /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x80) [0x7f1876951e40]
27 /test_allreduce() [0x201d95]
=================================