-
Notifications
You must be signed in to change notification settings - Fork 0
/
bistabcg-np2_old_nccl.txt
304 lines (297 loc) · 13.7 KB
/
bistabcg-np2_old_nccl.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
HOME:/home/kfutfd/qcu
-- The C compiler identification is GNU 11.4.0
-- The CXX compiler identification is GNU 11.4.0
-- The CUDA compiler identification is NVIDIA 12.4.131
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Check for working C compiler: /usr/bin/cc - skipped
-- Detecting C compile features
-- Detecting C compile features - done
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Check for working CXX compiler: /usr/bin/c++ - skipped
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Detecting CUDA compiler ABI info
-- Detecting CUDA compiler ABI info - done
-- Check for working CUDA compiler: /usr/local/cuda/bin/nvcc - skipped
-- Detecting CUDA compile features
-- Detecting CUDA compile features - done
-- Found MPI_C: /home/kfutfd/external-libraries/openmpi-4.1.5/lib/libmpi.so (found version "3.1")
-- Found MPI_CXX: /home/kfutfd/external-libraries/openmpi-4.1.5/lib/libmpi.so (found version "3.1")
-- Found MPI: TRUE (found version "3.1")
-- Configuring done
-- Generating done
-- Build files have been written to: /home/kfutfd/qcu
[ 10%] Building CUDA object CMakeFiles/qcu.dir/src/bistabcg.cu.o
[ 30%] Building CUDA object CMakeFiles/qcu.dir/src/draft.cu.o
[ 30%] Building CUDA object CMakeFiles/qcu.dir/src/clover_dslash.cu.o
[ 40%] Building CUDA object CMakeFiles/qcu.dir/src/lattice_cuda.cu.o
[ 50%] Building CUDA object CMakeFiles/qcu.dir/src/multgrid.cu.o
[ 80%] Building CUDA object CMakeFiles/qcu.dir/src/nccl_wilson_bistabcg.cu.o
[ 80%] Building CUDA object CMakeFiles/qcu.dir/src/nccl_wilson_dslash.cu.o
[ 80%] Building CUDA object CMakeFiles/qcu.dir/src/lattice_mpi.cu.o
[ 90%] Building CUDA object CMakeFiles/qcu.dir/src/wilson_dslash.cu.o
[100%] Linking CUDA shared library libqcu.so
[100%] Built target qcu
rm: cannot remove '.': Is a directory
rm: cannot remove '..': Is a directory
rm: cannot remove '.git': Is a directory
fatal: pathspec '"clangd.format.tabSize": 4' did not match any files
fatal: pathspec '.gitmodules' did not match any files
~/qcu/test ~/qcu
==989724== NVPROF is profiling process 989724, command: python ./test.nccl.bistabcg.qcu-np2.py
==989723== NVPROF is profiling process 989723, command: python ./test.nccl.bistabcg.qcu-np2.py
==989723== Warning: Unified Memory Profiling is not supported on the current configuration because a pair of devices without peer-to-peer support is detected on this multi-GPU setup. When peer mappings are not available, system falls back to using zero-copy memory. It can cause kernels, which access unified memory, to run slower. More details can be found at: http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#um-managed-memory
==989724== Warning: Unified Memory Profiling is not supported on the current configuration because a pair of devices without peer-to-peer support is detected on this multi-GPU setup. When peer mappings are not available, system falls back to using zero-copy memory. It can cause kernels, which access unified memory, to run slower. More details can be found at: http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#um-managed-memory
Disabling GPU-Direct RDMA access
Enabling peer-to-peer copy engine and direct load/store access
Intra-node (non peer-to-peer) enabled for rank 0 (gpu=0) with neighbor 1 (gpu=1) dir=0, dim=0
Intra-node (non peer-to-peer) enabled for rank 0 (gpu=0) with neighbor 1 (gpu=1) dir=1, dim=0
Intra-node (non peer-to-peer) enabled for rank 1 (gpu=1) with neighbor 0 (gpu=0) dir=0, dim=0
Intra-node (non peer-to-peer) enabled for rank 1 (gpu=1) with neighbor 0 (gpu=0) dir=1, dim=0
QUDA 1.1.0 (git 1.1.0--sm_60)
CUDA Driver version = 12020
CUDA Runtime version = 12040
Found device 0: Tesla P100-PCIE-16GB
Found device 1: Tesla P100-PCIE-16GB
Using device 0: Tesla P100-PCIE-16GB
WARNING: Data reordering done on GPU (set with QUDA_REORDER_LOCATION=GPU/CPU)
Loaded 16 sets of cached parameters from .cache/tunecache.tsv
Loaded 16 sets of cached parameters from .cache/tunecache.tsv
WARNING: Using device memory pool allocator
WARNING: Using pinned memory pool allocator
==989724== Warning: Profiling results might be incorrect with current version of nvcc compiler used to compile cuda app. Compile with nvcc compiler 9.0 or later version to get correct profiling results. Ignore this warning if code is already compiled with the recommended nvcc version
===============round 0 ======================
cublasCreated successfully
==989723== Warning: Profiling results might be incorrect with current version of nvcc compiler used to compile cuda app. Compile with nvcc compiler 9.0 or later version to get correct profiling results. Ignore this warning if code is already compiled with the recommended nvcc version
===============round 0 ======================
Creating Gaussian distrbuted Lie group field with sigma = 1.000000e-01
node_rank :1
node_size :2
gridDim.x :2048
blockDim.x :256
lat_1dim[_X_] :8
lat_1dim[_Y_] :32
lat_1dim[_Z_] :32
lat_1dim[_T_] :64
grid_1dim[_X_] :2
grid_1dim[_Y_] :1
grid_1dim[_Z_] :1
grid_1dim[_T_] :1
lat_3dim[_YZT_] :65536
lat_3dim[_XZT_] :16384
lat_3dim[_XYT_] :16384
lat_3dim[_XYZ_] :8192
lat_4dim :524288
lat_4dim_SC :6291456
lat_3dim_Half_SC[_YZT_] :393216
lat_3dim_Half_SC[_XZT_] :98304
lat_3dim_Half_SC[_XYT_] :98304
lat_3dim_Half_SC[_XYZ_] :49152
node_rank :0
node_size :2
gridDim.x :2048
blockDim.x :256
lat_1dim[_X_] :8
lat_1dim[_Y_] :32
lat_1dim[_Z_] :32
lat_1dim[_T_] :64
grid_1dim[_X_] :2
grid_1dim[_Y_] :1
grid_1dim[_Z_] :1
grid_1dim[_T_] :1
lat_3dim[_YZT_] :65536
lat_3dim[_XZT_] :16384
lat_3dim[_XYT_] :16384
lat_3dim[_XYZ_] :8192
lat_4dim :524288
lat_4dim_SC :6291456
lat_3dim_Half_SC[_YZT_] :393216
lat_3dim_Half_SC[_XZT_] :98304
lat_3dim_Half_SC[_XYT_] :98304
lat_3dim_Half_SC[_XYZ_] :49152
lat_3dim_SC[_YZT_]:786432
lat_3dim_SC[_XZT_]:196608
lat_3dim_SC[_XYT_]:196608
lat_3dim_SC[_YZT_]:786432
lat_3dim_SC[_XZT_]:196608
lat_3dim_SC[_XYT_]:196608
lat_3dim_SC[_XYZ_]:98304
lat_3dim_SC[_XYZ_]:98304
##RANK:1##LOOP:0##Residual:44449.3
##RANK:0##LOOP:0##Residual:44449.3
##RANK:1##LOOP:1##Residual:8299.01
##RANK:0##LOOP:1##Residual:8299.01
##RANK:1##LOOP:2##Residual:200371
##RANK:0##LOOP:2##Residual:200371
##RANK:1##LOOP:3##Residual:3.58229e+06
##RANK:0##LOOP:3##Residual:3.58229e+06
##RANK:1##LOOP:4##Residual:517.569
##RANK:0##LOOP:4##Residual:517.569
##RANK:1##LOOP:5##Residual:127.916
##RANK:0##LOOP:5##Residual:127.916
##RANK:1##LOOP:6##Residual:93.7209
##RANK:0##LOOP:6##Residual:93.7209
##RANK:1##LOOP:7##Residual:16.9229
##RANK:0##LOOP:7##Residual:16.9229
##RANK:1##LOOP:8##Residual:10.6846
##RANK:0##LOOP:8##Residual:10.6846
##RANK:1##LOOP:9##Residual:5.7094
##RANK:0##LOOP:9##Residual:5.7094
##RANK:1##LOOP:10##Residual:3.25682
##RANK:0##LOOP:10##Residual:3.25682
##RANK:1##LOOP:11##Residual:2.51426
##RANK:0##LOOP:11##Residual:2.51426
##RANK:1##LOOP:12##Residual:1.55529
##RANK:0##LOOP:12##Residual:1.55529
##RANK:1##LOOP:13##Residual:0.809241
##RANK:0##LOOP:13##Residual:0.809241
##RANK:1##LOOP:14##Residual:1.09693
##RANK:0##LOOP:14##Residual:1.09693
##RANK:1##LOOP:15##Residual:4.94594
##RANK:0##LOOP:15##Residual:4.94594
##RANK:1##LOOP:16##Residual:0.78322
##RANK:0##LOOP:16##Residual:0.78322
##RANK:1##LOOP:17##Residual:1.24827
##RANK:0##LOOP:17##Residual:1.24827
##RANK:1##LOOP:18##Residual:7.19161
##RANK:0##LOOP:18##Residual:7.19161
##RANK:1##LOOP:19##Residual:20.2739
##RANK:0##LOOP:19##Residual:20.2739
##RANK:1##LOOP:20##Residual:0.247403
##RANK:0##LOOP:20##Residual:0.247403
##RANK:1##LOOP:21##Residual:0.0957543
##RANK:0##LOOP:21##Residual:0.0957543
##RANK:1##LOOP:22##Residual:0.0262302
##RANK:0##LOOP:22##Residual:0.0262302
##RANK:1##LOOP:23##Residual:0.0160813
##RANK:0##LOOP:23##Residual:0.0160813
##RANK:1##LOOP:24##Residual:0.0124861
##RANK:0##LOOP:24##Residual:0.0124861
##RANK:1##LOOP:25##Residual:0.0108439
##RANK:0##LOOP:25##Residual:0.0108439
##RANK:1##LOOP:26##Residual:0.00867592
##RANK:0##LOOP:26##Residual:0.00867592
##RANK:1##LOOP:27##Residual:0.00809997
##RANK:0##LOOP:27##Residual:0.00809997
##RANK:1##LOOP:28##Residual:0.00989782
##RANK:0##LOOP:28##Residual:0.00989782
##RANK:1##LOOP:29##Residual:0.00427806
##RANK:0##LOOP:29##Residual:0.00427806
##RANK:1##LOOP:30##Residual:0.00329498
##RANK:0##LOOP:30##Residual:0.00329498
##RANK:1##LOOP:31##Residual:0.00275855
##RANK:0##LOOP:31##Residual:0.00275855
##RANK:1##LOOP:32##Residual:0.00240199
##RANK:0##LOOP:32##Residual:0.00240199
##RANK:1##LOOP:33##Residual:0.00210661
##RANK:0##LOOP:33##Residual:0.00210661
##RANK:1##LOOP:34##Residual:0.00954212
##RANK:0##LOOP:34##Residual:0.00954212
##RANK:1##LOOP:35##Residual:0.00197213
##RANK:0##LOOP:35##Residual:0.00197213
##RANK:1##LOOP:36##Residual:0.00259008
##RANK:0##LOOP:36##Residual:0.00259008
##RANK:1##LOOP:37##Residual:0.00184452
##RANK:0##LOOP:37##Residual:0.00184452
##RANK:1##LOOP:38##Residual:0.00244576
##RANK:0##LOOP:38##Residual:0.00244576
##RANK:1##LOOP:39##Residual:0.00123193
##RANK:0##LOOP:39##Residual:0.00123193
##RANK:1##LOOP:40##Residual:0.00104646
##RANK:0##LOOP:40##Residual:0.00104646
##RANK:1##LOOP:41##Residual:0.000968751
##RANK:0##LOOP:41##Residual:0.000968751
##RANK:1##LOOP:42##Residual:0.000836163
##RANK:0##LOOP:42##Residual:0.000836163
##RANK:1##LOOP:43##Residual:0.000842265
##RANK:0##LOOP:43##Residual:0.000842265
##RANK:1##LOOP:44##Residual:0.000669099
##RANK:0##LOOP:44##Residual:0.000669099
##RANK:1##LOOP:45##Residual:9.55699e-05
##RANK:0##LOOP:45##Residual:9.55699e-05
##RANK:1##LOOP:46##Residual:8.52666e-05
##RANK:0##LOOP:46##Residual:8.52666e-05
##RANK:1##LOOP:47##Residual:5.6289e-05
##RANK:0##LOOP:47##Residual:5.6289e-05
##RANK:1##LOOP:48##Residual:3.86423e-05
##RANK:0##LOOP:48##Residual:3.86423e-05
##RANK:1##LOOP:49##Residual:3.72799e-05
##RANK:0##LOOP:49##Residual:3.72799e-05
##RANK:1##LOOP:50##Residual:3.52888e-05
##RANK:0##LOOP:50##Residual:3.52888e-05
##RANK:1##LOOP:51##Residual:3.30243e-05
##RANK:0##LOOP:51##Residual:3.30243e-05
##RANK:1##LOOP:52##Residual:2.49218e-05
##RANK:0##LOOP:52##Residual:2.49218e-05
##RANK:1##LOOP:53##Residual:2.04535e-05
##RANK:0##LOOP:53##Residual:2.04535e-05
##RANK:1##LOOP:54##Residual:1.92412e-05
##RANK:0##LOOP:54##Residual:1.92412e-05
##RANK:1##LOOP:55##Residual:1.58756e-05
##RANK:0##LOOP:55##Residual:1.58756e-05
##RANK:1##LOOP:56##Residual:1.70427e-05
##RANK:0##LOOP:56##Residual:1.70427e-05
##RANK:1##LOOP:57##Residual:7.26063e-06
##RANK:0##LOOP:57##Residual:7.26063e-06
##RANK:1##LOOP:58##Residual:1.076e-05
##RANK:0##LOOP:58##Residual:1.076e-05
##RANK:1##LOOP:59##Residual:9.92769e-06
##RANK:0##LOOP:59##Residual:9.92769e-06
##RANK:1##LOOP:60##Residual:4.39433e-06
##RANK:0##LOOP:60##Residual:4.39433e-06
##RANK:1##LOOP:61##Residual:2.62005e-06
##RANK:0##LOOP:61##Residual:2.62005e-06
##RANK:1##LOOP:62##Residual:2.92632e-06
##RANK:0##LOOP:62##Residual:2.92632e-06
##RANK:1##LOOP:63##Residual:3.14406e-06
##RANK:0##LOOP:63##Residual:3.14406e-06
##RANK:1##LOOP:64##Residual:5.78504e-06
##RANK:0##LOOP:64##Residual:5.78504e-06
##RANK:1##LOOP:65##Residual:2.44945e-06
##RANK:0##LOOP:65##Residual:2.44945e-06
##RANK:1##LOOP:66##Residual:1.12525e-06
##RANK:0##LOOP:66##Residual:1.12525e-06
##RANK:1##LOOP:67##Residual:6.8871e-07
##RANK:0##LOOP:67##Residual:6.8871e-07
nccl wilson bistabcg total time: (without malloc free memcpy) :10.331393189 sec
nccl wilson bistabcg total time: (without malloc free memcpy) :10.331463702 sec
## difference: 0.0000000000740362
## difference: 0.0000000000740362
QCU bistabcg: 11.762558162001369 sec
QCU bistabcg: 11.76105322699732 sec
WARNING: Environment variable QUDA_PROFILE_OUTPUT_BASE not set; writing to profile.tsv and profile_async.tsv
Saving 16 sets of cached parameters to .cache/profile_0.tsv
Saving 0 sets of cached profiles to .cache/profile_async_0.tsv
initQuda Total time = 1.412 secs
init = 1.412 secs ( 99.999%), with 2 calls at 7.058e+05 us per call
total accounted = 1.412 secs ( 99.999%)
total missing = 0.000 secs ( 0.001%)
loadGaugeQuda Total time = 0.109 secs
download = 0.073 secs ( 66.451%), with 2 calls at 3.633e+04 us per call
upload = 0.013 secs ( 12.017%), with 1 calls at 1.314e+04 us per call
init = 0.005 secs ( 4.917%), with 2 calls at 2.688e+03 us per call
compute = 0.018 secs ( 16.512%), with 2 calls at 9.028e+03 us per call
free = 0.000 secs ( 0.032%), with 2 calls at 1.750e+01 us per call
total accounted = 0.109 secs ( 99.929%)
total missing = 0.000 secs ( 0.071%)
endQuda Total time = 0.033 secs
initQuda-endQuda Total time = 14.679 secs
QUDA Total time = 1.610 secs
download = 0.073 secs ( 4.514%), with 2 calls at 3.633e+04 us per call
upload = 0.013 secs ( 0.816%), with 1 calls at 1.314e+04 us per call
init = 1.417 secs ( 88.030%), with 4 calls at 3.543e+05 us per call
compute = 0.074 secs ( 4.566%), with 3 calls at 2.450e+04 us per call
free = 0.000 secs ( 0.002%), with 2 calls at 1.800e+01 us per call
total accounted = 1.576 secs ( 97.929%)
total missing = 0.033 secs ( 2.071%)
Device memory used = 1406.2 MiB
Pinned device memory used = 0.0 MiB
Managed memory used = 0.0 MiB
Shmem memory used = 0.0 MiB
Page-locked host memory used = 54.0 MiB
Total host memory used >= 74.3 MiB
==989724== Generated result file: /home/kfutfd/qcu/test/log_x99-Yang_989724.nvvp
==989723== Generated result file: /home/kfutfd/qcu/test/log_x99-Yang_989723.nvvp
~/qcu