-
Notifications
You must be signed in to change notification settings - Fork 3
/
Copy pathREADME
207 lines (175 loc) · 13 KB
/
README
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
*-*-*-*-*-*-*-*-*
| | | | | | | | |
*-*-*-neve-*-*-*-
| | | | | | | | |
*-*-*-*-*-*-*-*-*
*****
About
*****
`neve` is a microbenchmark that distributes a graph (real-world or synthetic) across
processes and then performs variable-sized message exchanges (equivalent to the number
of ghost vertices shared between processes) between process neighbors and reports
bandwidth/latency. Further, neve allows options to shrink a real-world graph (by a
percent of ghost vertices shared across processes), study performance of message exchanges
for a specific process neighborhood, analyze the impact of an edge-balanced real-world
graph distribution, and generate a graph in-memory. neve reports bandwidth/latency, and
follows the best practices of existing MPI microbenchmarks.
We require the total number of processes to be a power of 2 and total number of vertices
to be perfectly divisible by the number of processes when parallel RGG generation options
are used. This constraint does not apply to real world graphs passed to neve.
neve has an in-memory random geometric graph generator (just like our other mini-application,
miniVite[?]) that can be used for weak-scaling analysis. An n-D random geometric graph (RGG)
is generated by randomly placing N vertices in an n-D space and connecting pairs of vertices
whose Euclidean distance is less than or equal to d. We only consider 2D RGGs contained within
a unit square, [0,1]^2. We distribute the domain such that each process receives N/p vertices
(where p is the total number of processes). Each process owns (1 * 1/p) portion of the unit
square and d is computed as (please refer to Section 4 of miniVite paper[?] for details):
d = (dc + dt)/2;
where, dc = sqrt(ln(N) / pi*N); dt = sqrt(2.0736 / pi*N)
Therefore, the number of vertices (N) passed during execution on p processes must satisfy
the condition -- 1/p > d. Unlike miniVite, the edge weights of the generated graph is always
1.0, and not the Euclidean distance between vertices (because we only perform communication in
this microbenchmark, and no computation, so edge weights are irrelevant).
[?] Ghosh, Sayan, Mahantesh Halappanavar, Antonio Tumeo, Ananth Kalyanaraman, and
Assefaw H. Gebremedhin. "miniVite: A Graph Analytics Benchmarking Tool for Massively
Parallel Systems." In 2018 IEEE/ACM Performance Modeling, Benchmarking and Simulation of
High Performance Computer Systems (PMBS), pp. 51-56. IEEE, 2018.
Please note, the default distribution of graph generated from the in-built random
geometric graph generator causes a process to only communicate with its two
immediate neighbors. If you want to increase the communication intensity for
generated graphs, please use the "-p" option to specify an extra percentage of edges
that will be generated, linking random vertices. As a side-effect, this option
significantly increases the time required to generate the graph, therefore low values
are preferred. The max number of edges that can be added randomly must be less than
equal to INT_MAX, at present we don't handle cases in which "-p <percent>" resolves to
extra edges more than INT_MAX.
We also allow users to pass any real world graph as input. However, we expect an input graph
to be in a certain binary format, which we have observed to be more efficient than reading
ASCII format files. The code for binary conversion (from a variety of common graph formats)
is packaged separately with another software called Vite, which is our distributed-memory
implementation of graph community detection. Please follow instructions in Vite README for
binary file conversion. Vite could be downloaded from (please don't use the past PNNL/PNL
link to download Vite, the following GitHub link is the correct one):
https://github.com/Exa-Graph/vite
Recently, we have added a non-MPI shared-memory version of this microbenchmark, which is
inspired from the STREAM benchmark, except arrays are replaced by graphs. The graph data
structure is CSR, and there are two tests: `Neighborhood Scan' and `Neighborhood Sum',
both of which scans graph neighborhood (an innate pattern for many graph workloads) and
reports average bandwidth and latency. Like the MPI-based neve, these tests can handle both
real-world (same binary format) and synthetic graphs (using the RGG generator as discussed above).
Please contact the following for any queries or support:
Sayan Ghosh, PNNL (sg0 at pnnl dot gov)
Related paper (only covers the communication aspects):
Ghosh S, Tallent N, Halappanavar M. Characterizing Performance of Graph
Neighborhood Communication Patterns. IEEE Transactions on Parallel and
Distributed Systems. 2021 Aug 2.
https://ieeexplore.ieee.org/abstract/document/9503355
*******
Compile
*******
neve is a C++11 header-only library and requires an MPI implementation. It uses MPI Send/Recv and
collectives (for synthetic graph generation). Please update the Makefile with compiler flags and
use a C++11 compliant compiler of your choice. Invoke `make clean; make` after setting paths
to MPI for generating the binary. Use `mpirun` or `mpiexec` or `srun` to execute the code with
specific runtime arguments mentioned in the next section.
It is also possible to run neve on a network simulator such as SST-Macro, please enable the
ENABLE_SSTMACRO shell variable in the Makefile (and pass the path to SST-Macro build), and consult
the SST-Macro document for running MPI codes with parameter files.
Pass -DPRINT_DIST_STATS while building for printing distributed graph characteristics.
[Specific instructions for neve_threads]
For the non-MPI build on real-world graphs, please pass a suitable value (equal to the #sockets
or NUMA nodes on the system) to the GRAPH_FT_LOAD macro at compile-time, like -DGRAPH_FT_LOAD=4.
This is very important for `first touch' purposes, but has no effect when RGG codepath is selected.
We are yet to enable first touch When the synthetic graph option is selected (i.e., RGG codepath),
since the graph generation part is serial. This would be enabled in a future commit.
We have also enabled default parameters such that executing without any arguments will work. STREAM
by default initiates an array of 10,000,000 elements, so we have set the default graph (one that would
get generated if you just invoke ./neve_threads) to have dimensions such that it would generate about
10M edges. The graph generation part is serial, so large graphs can take significantly longer to generate
(but we believe reasonable sized graphs can be generated fairly quickly).
Also, our edge data structure consists of a vertex end-point and weight, optionally we support
an edge data structure consisting of a head, tail and weight (vertex pair). In order to enable the
edge data structure that stores a `vertex pair', -DEDGE_AS_VERTEX_PAIR must be passed (this is
option also supports the binary file format of Grappolo*).
It is better to keep the other options in the Makefile intact, only changing the OpenMP and other
optimization flags depending on the compiler. Please check the code to review other macros of interest.
[*] https://github.com/Exa-Graph/grappolo
*****************
Execution options
*****************
E.g.:
mpiexec -n 2 bin/./neve_mpi -f karate.bin -w -t 0 -z 5
mpiexec -n 4 bin/./neve_mpi -f karate.bin -w -t 0 -s 2 -g 5
mpiexec -n 2 bin/./neve_mpi -l -n 100 -w
mpiexec -n 2 bin/./neve_mpi -n 100 -t 0
mpiexec -n 2 bin/./neve_mpi -p 2 -n 100 -w
bin/./neve_threads -n 64
bin/./neve_threads -f karate.bin
bin/./neve_threads -p 2 -n 100
Possible options for MPI version, i.e., neve_mpi (can be combined):
1. -f <bin-file> : Specify input binary file after this argument.
2. -b : Only valid for real-world inputs. Attempts to distribute approximately
equal number of edges among processes. Irregular number of vertices
owned by a particular process. Increases the distributed graph creation
time due to serial overheads, but may improve overall execution time.
3. -n <vertices> : Only valid for synthetically generated inputs. Pass total number of
vertices of the generated graph.
4. -l : Use distributed LCG for randomly choosing edges. If this option
is not used, we will use C++ random number generator (using
std::default_random_engine).
5. -p <percent> : Only valid for synthetically generated inputs. Specify percent of overall
edges to be randomly generated between processes.
6. -r <nranks> : This is used to control the number of aggregators in MPI I/O and is
meaningful when an input binary graph file is passed with option "-f".
naggr := (nranks > 1) ? (nprocs/nranks) : nranks;
7 -w : Report Bandwidth in MB/s[*].
8. -t <0|1|2> : Report Latency in microseconds[*]. Option '0' uses nonblocking Send/Recv,
Option '1' uses MPI_Neighbor_alltoall and Option '2' uses MPI_Neighbor_allgather.
9. -x <bytes> : Maximum data exchange size (in bytes).
10. -m <bytes> : Minimum data exchange size (in bytes).
11. -d <0|1> : Perform work in addition to message exchanges (for latency test, i.e., option `-t <>`)
between the process neighbors. `-d 0` invokes kernel that computes max weight per vertex
degree and `-d 1` invokes kernel that sums weight per vertex degree.
Both the kernels can use OpenMP.
12. -s <PE> : Analyze a single process neighborhood. Performs bidirectional message
exchanges between the process neighbors of a particular PE passed by the
user. This transforms the neighbor subgraph of a particular PE as a fully
connected graph.
13. -g <count> : Specify maximum number of ghosts shared between process neighbors of a
particular PE. This option is only valid when -s <PE> is passed (see #11).
14. -z <percent> : Select a percentage of ghost vertices of a real-world graph distributed
across processes as actual ghosts. For e.g., lets assume that a graph is
distributed across 4 processes, and PE#0 has two neighbors, PE#1 and PE#3
with whom it shares a 100 ghost vertices each. In such a configuration,
the b/w test would perform variable-sized message exchanges a 100 times
between PE#0 and {PE#1, PE#3}. The -z <percent> option can limit the
number of message exchanges by selecting a percentage of the actual the number
of ghosts, but maintaining the overall structure of the original graph.
Hence, this option can help 'shrink' a graph. Only valid for b/w test, because
latency test will just perform message transfers between process neighborhoods.
15. -o <1-??> : Generates comma-separated (default) MPI rank order based on graph distribution across
processes - regular (1), weighted (2), regular & ascending order of degrees (3),
regular & descending order of degrees (4), weighted & ascending order of degrees (5),
weighted & descending order of degrees (6), weighted & normal distribution of degrees (7),
common neighbors & descending order of degrees (8), common neighbors & ascending order of
degrees (9), common neighbors & normal distribution of degrees (10) and graph matching (>=11).
Default is weighted & ascending order of degrees (5).
16. -i <0-??> : Dump out process graph (default: adjacency), if greater than 0 (e.g., -i 1) is passed,
then graph in directed SANDIA Chaco format is returned (-i 3 is for weighted).
17. -a : Perform top-down BFS (similar to Graph500 reference implementation). Build with
-DUSE_ALLREDUCE_FOR_EXIT, otherwise it hangs sometimes.
18. -j : Perform delta-stepping SSSP (similar to Graph500 reference implementation).
[*]Note: Unless -w or -t <...> is passed, the code will just load the graph and create the graph data
structure. This is deliberate, in case we want to measure the overhead of loading a graph and
time only the file I/O part (like measuring the impact of different #aggregators through the
-r <nranks> option).
Possible options for non-MPI version, i.e., neve_threads (can be combined):
1. -f <bin-file> : Specify input binary file after this argument.
2. -n <vertices> : Only valid for synthetically generated inputs. Pass total number of
vertices of the generated graph.
3. -l : Use distributed LCG for randomly choosing edges. If this option
is not used, we will use C++ random number generator (using
std::default_random_engine).
4. -p <percent> : Only valid for synthetically generated inputs. Specify percent of overall
edges to be randomly generated between vertices.
5. -h : Print run options.