README

*-*-*-*-*-*-*-*-*
| | | | | | | | |
*-*-*-neve-*-*-*-
| | | | | | | | |
*-*-*-*-*-*-*-*-*

*****
About
*****

`neve` is a microbenchmark that distributes a graph (real-world or synthetic) across
processes and then performs variable-sized message exchanges (equivalent to the number 
of ghost vertices shared between processes) between process neighbors and reports 
bandwidth/latency. Further, neve allows options to shrink a real-world graph (by a 
percent of ghost vertices shared across processes), study performance of message exchanges 
for a specific process neighborhood, analyze the impact of an edge-balanced real-world 
graph distribution, and generate a graph in-memory. neve reports bandwidth/latency, and 
follows the best practices of existing MPI microbenchmarks.

We require the total number of processes to be a power of 2 and total number of vertices 
to be perfectly divisible by the number of processes when parallel RGG generation options 
are used. This constraint does not apply to real world graphs passed to neve.

neve has an in-memory random geometric graph generator (just like our other mini-application,
miniVite[?]) that can be used for weak-scaling analysis. An n-D random geometric graph (RGG) 
is generated by randomly placing N vertices in an n-D space and connecting pairs of vertices 
whose Euclidean distance is less than or equal to d. We only consider 2D RGGs contained within 
a unit square, [0,1]^2. We distribute the domain such that each process receives N/p vertices 
(where p is the total number of processes). Each process owns (1 * 1/p) portion of the unit 
square and d is computed as (please refer to Section 4 of miniVite paper[?] for details): 

d = (dc + dt)/2;
where, dc = sqrt(ln(N) / pi*N); dt = sqrt(2.0736 / pi*N)

Therefore, the number of vertices (N) passed during execution on p processes must satisfy 
the condition -- 1/p > d. Unlike miniVite, the edge weights of the generated graph is always 
1.0, and not the Euclidean distance between vertices (because we only perform communication in 
this microbenchmark, and no computation, so edge weights are irrelevant).

[?] Ghosh, Sayan, Mahantesh Halappanavar, Antonio Tumeo, Ananth Kalyanaraman, and 
Assefaw H. Gebremedhin. "miniVite: A Graph Analytics Benchmarking Tool for Massively 
Parallel Systems." In 2018 IEEE/ACM Performance Modeling, Benchmarking and Simulation of 
High Performance Computer Systems (PMBS), pp. 51-56. IEEE, 2018.

Please note, the default distribution of graph generated from the in-built random 
geometric graph generator causes a process to only communicate with its two 
immediate neighbors. If you want to increase the communication intensity for 
generated graphs, please use the "-p" option to specify an extra percentage of edges 
that will be generated, linking random vertices. As a side-effect, this option 
significantly increases the time required to generate the graph, therefore low values 
are preferred. The max number of edges that can be added randomly must be less than 
equal to INT_MAX, at present we don't handle cases in which "-p <percent>" resolves to 
extra edges more than INT_MAX.

We also allow users to pass any real world graph as input. However, we expect an input graph 
to be in a certain binary format, which we have observed to be more efficient than reading 
ASCII format files. The code for binary conversion (from a variety of common graph formats) 
is packaged separately with another software called Vite, which is our distributed-memory
implementation of graph community detection. Please follow instructions in Vite README for 
binary file conversion. Vite could be downloaded from (please don't use the past PNNL/PNL 
link to download Vite, the following GitHub link is the correct one): 
https://github.com/Exa-Graph/vite

Recently, we have added a non-MPI shared-memory version of this microbenchmark, which is 
inspired from the STREAM benchmark, except arrays are replaced by graphs. The graph data 
structure is CSR, and there are two tests: `Neighborhood Scan' and `Neighborhood Sum',
both of which scans graph neighborhood (an innate pattern for many graph workloads) and 
reports average bandwidth and latency. Like the MPI-based neve, these tests can handle both 
real-world (same binary format) and synthetic graphs (using the RGG generator as discussed above). 

Please contact the following for any queries or support:
Sayan Ghosh, PNNL (sg0 at pnnl dot gov)

Related paper (only covers the communication aspects):
Ghosh S, Tallent N, Halappanavar M. Characterizing Performance of Graph
Neighborhood Communication Patterns. IEEE Transactions on Parallel and
Distributed Systems. 2021 Aug 2.
https://ieeexplore.ieee.org/abstract/document/9503355


*******
Compile
*******

neve is a C++11 header-only library and requires an MPI implementation. It uses MPI Send/Recv and 
collectives (for synthetic graph generation). Please update the Makefile with compiler flags and 
use a C++11 compliant compiler of your choice. Invoke `make clean; make` after setting paths 
to MPI for generating the binary. Use `mpirun` or `mpiexec` or `srun` to execute the code with 
specific runtime arguments mentioned in the next section. 

It is also possible to run neve on a network simulator such as SST-Macro, please enable the 
ENABLE_SSTMACRO shell variable in the Makefile (and pass the path to SST-Macro build), and consult 
the SST-Macro document for running MPI codes with parameter files.

Pass -DPRINT_DIST_STATS while building for printing distributed graph characteristics.

[Specific instructions for neve_threads] 

For the non-MPI build on real-world graphs, please pass a suitable value (equal to the #sockets 
or NUMA nodes on the system) to the GRAPH_FT_LOAD macro at compile-time, like -DGRAPH_FT_LOAD=4. 
This is very important for `first touch' purposes, but has no effect when RGG codepath is selected. 
We are yet to enable first touch When the synthetic graph option is selected (i.e., RGG codepath),
since the graph generation part is serial. This would be enabled in a future commit.

We have also enabled default parameters such that executing without any arguments will work. STREAM 
by default initiates an array of 10,000,000 elements, so we have set the default graph (one that would 
get generated if you just invoke ./neve_threads) to have dimensions such that it would generate about 
10M edges. The graph generation part is serial, so large graphs can take significantly longer to generate 
(but we believe reasonable sized graphs can be generated fairly quickly).

Also, our edge data structure consists of a vertex end-point and weight, optionally we support
an edge data structure consisting of a head, tail and weight (vertex pair). In order to enable the
edge data structure that stores a `vertex pair', -DEDGE_AS_VERTEX_PAIR must be passed (this is 
option also supports the binary file format of Grappolo*).

It is better to keep the other options in the Makefile intact, only changing the OpenMP and other 
optimization flags depending on the compiler. Please check the code to review other macros of interest.

[*] https://github.com/Exa-Graph/grappolo

*****************
Execution options
*****************
E.g.: 

mpiexec -n 2 bin/./neve_mpi -f karate.bin -w -t 0 -z 5
mpiexec -n 4 bin/./neve_mpi -f karate.bin -w -t 0 -s 2 -g 5
mpiexec -n 2 bin/./neve_mpi -l -n 100 -w 
mpiexec -n 2 bin/./neve_mpi -n 100 -t 0
mpiexec -n 2 bin/./neve_mpi -p 2 -n 100 -w 

bin/./neve_threads -n 64
bin/./neve_threads -f karate.bin
bin/./neve_threads -p 2 -n 100 

Possible options for MPI version, i.e., neve_mpi (can be combined):

1.  -f <bin-file>   : Specify input binary file after this argument. 
2.  -b              : Only valid for real-world inputs. Attempts to distribute approximately 
                      equal number of edges among processes. Irregular number of vertices
                      owned by a particular process. Increases the distributed graph creation
                      time due to serial overheads, but may improve overall execution time.
3.  -n <vertices>   : Only valid for synthetically generated inputs. Pass total number of 
                      vertices of the generated graph.
4.  -l              : Use distributed LCG for randomly choosing edges. If this option 
                      is not used, we will use C++ random number generator (using 
                      std::default_random_engine).
5.  -p <percent>    : Only valid for synthetically generated inputs. Specify percent of overall 
                      edges to be randomly generated between processes.
6.  -r <nranks>     : This is used to control the number of aggregators in MPI I/O and is
                      meaningful when an input binary graph file is passed with option "-f".
                      naggr := (nranks > 1) ? (nprocs/nranks) : nranks;
7   -w              : Report Bandwidth in MB/s[*].
8.  -t <0|1|2>      : Report Latency in microseconds[*]. Option '0' uses nonblocking Send/Recv, 
                      Option '1' uses MPI_Neighbor_alltoall and Option '2' uses MPI_Neighbor_allgather.
9.  -x <bytes>      : Maximum data exchange size (in bytes).
10. -m <bytes>      : Minimum data exchange size (in bytes).
11. -d <0|1>        : Perform work in addition to message exchanges (for latency test, i.e., option `-t <>`) 
                      between the process neighbors. `-d 0` invokes kernel that computes max weight per vertex 
                      degree and `-d 1` invokes kernel that sums weight per vertex degree. 
                      Both the kernels can use OpenMP.
12. -s <PE>         : Analyze a single process neighborhood. Performs bidirectional message 
                      exchanges between the process neighbors of a particular PE passed by the 
                      user. This transforms the neighbor subgraph of a particular PE as a fully
                      connected graph.
13. -g <count>      : Specify maximum number of ghosts shared between process neighbors of a 
                      particular PE. This option is only valid when -s <PE> is passed (see #11).
14. -z <percent>    : Select a percentage of ghost vertices of a real-world graph distributed
                      across processes as actual ghosts. For e.g., lets assume that a graph is
                      distributed across 4 processes, and PE#0 has two neighbors, PE#1 and PE#3
                      with whom it shares a 100 ghost vertices each. In such a configuration, 
                      the b/w test would perform variable-sized message exchanges a 100 times 
		      between PE#0 and {PE#1, PE#3}. The -z <percent> option can limit the 
                      number of message exchanges by selecting a percentage of the actual the number 
                      of ghosts, but maintaining the overall structure of the original graph. 
                      Hence, this option can help 'shrink' a graph. Only valid for b/w test, because
		      latency test will just perform message transfers between process neighborhoods.
15. -o <1-??>       : Generates comma-separated (default) MPI rank order based on graph distribution across 
                      processes - regular (1), weighted (2), regular & ascending order of degrees (3), 
                      regular & descending order of degrees (4), weighted & ascending order of degrees (5), 
                      weighted & descending order of degrees (6), weighted & normal distribution of degrees (7),  
                      common neighbors & descending order of degrees (8), common neighbors & ascending order of 
                      degrees (9), common neighbors & normal distribution of degrees (10) and graph matching (>=11).
                      Default is weighted & ascending order of degrees (5).
16. -i <0-??>       : Dump out process graph (default: adjacency), if greater than 0 (e.g., -i 1) is passed, 
                      then graph in directed SANDIA Chaco format is returned (-i 3 is for weighted).
17. -a              : Perform top-down BFS (similar to Graph500 reference implementation). Build with 
                      -DUSE_ALLREDUCE_FOR_EXIT, otherwise it hangs sometimes.
18. -j              : Perform delta-stepping SSSP (similar to Graph500 reference implementation).

[*]Note: Unless -w or -t <...> is passed, the code will just load the graph and create the graph data 
structure. This is deliberate, in case we want to measure the overhead of loading a graph and 
time only the file I/O part (like measuring the impact of different #aggregators through the 
-r <nranks> option).


Possible options for non-MPI version, i.e., neve_threads (can be combined):

1.  -f <bin-file>   : Specify input binary file after this argument. 
2.  -n <vertices>   : Only valid for synthetically generated inputs. Pass total number of 
                      vertices of the generated graph.
3.  -l              : Use distributed LCG for randomly choosing edges. If this option 
                      is not used, we will use C++ random number generator (using 
                      std::default_random_engine).
4.  -p <percent>    : Only valid for synthetically generated inputs. Specify percent of overall 
                      edges to be randomly generated between vertices.
5.  -h              : Print run options.