A case study of the MPI Broadcast and MPI Reduce collective operations.
Explore the docs Β»
Read the official report Β»
View Demo
Β·
Report Bug
Β·
Request Feature
π Table of Contents
This repository contains collected data and analysis of the latencies for the MPI_Bcast
and MPI_Reduce
collective operations. This study has been conducted as part of the final exam for the High Performance Computing (HPC) corse held at the University of Trieste (UniTS) during the academic year 2023-2024.
The data have been mainly collected on EPYC
nodes on the ORFEO cluster at AREA Science Park, Basovizza (TS) in January/February 2024 using the well known OSU
benchmark and are available in the datasets/
folder. These nodes are equipped with AMD EPYC 7H12 (Rome) processors.
More in depth informations and further details are available in the attached report you can find in this repository.
This repository also contains a Python
package named epyc
containing a small simulative model for EPYC
nodes that has been used to conduct the necessary computations for the analysis.
The module is essentially a collection of classes and methods that allows the user to simulate MPI core allocation on a real EPYC
machine. The module, in fact, allows to create Node
objects and to initialize a certain number of processes on them according to different mapping policies, as done by the map-by
option of the mpirun
command of the MPI library.
The module is also able to simulate the latency of the MPI_Broadcast
and MPI_Reduce
collective operations on the EPYC
nodes, using the data collected on the ORFEO cluster. The latency is predicted based on a point-to-point communication model. Once more, further details on the model are available on the report in this repository. The data have been collected through the submission of several jobs that can be found in the jobs/
folder.
The module also offers few utility functions to plot and to perform statistical analysis on the collected data collected for the latencies.
The majority of the implemented functions and classes are documented, hence further info about inputs and usage can be obtained with the help
function in Python.
Some usage examples can be found in the apps/
folder or in the Jupiter notebooks in the notebooks/
folder.
If you want to use the implemented epyc
module you can follow these steps. It is anyway recommended to take a look at the scripts in the apps/
folder for eventual usage examples.
Prerequisites needed to repeat the measurements and the data analysis:
Python 3.10
or higherOSU MPI Benchmarks
installed on the target machine (version 7.3)OpenMPI
library installed on the target machine- Access to ORFEO cluster or any equivalent HPC platform with SLURM scheduler
The module comes with a setup.py
file in the root directory, hence it can be installed with the following command:
pip install -e .
from the root directory of the project. After that, the module can be imported in any Python script or notebook with the following command:
import epyc
Or, alternatively to also use the utilities functions one can import:
from epyc import *
from utils import *
Alternatively the modules can be used by manually updating the PYTHONPATH
environment before running the scripts or notebooks.
The example.py
script in the apps/
folder contains some usage examples of the implemented classes and methods. The script can be run with the following command:
python apps/examples.py
from the root directory. Running the script will produce the following output:
Now these nodes are empty:
Node 0:
ββββββββββββββββββββ ββββββββββββββββββββ
β β¬β¬β¬β¬β¬β¬β¬β¬ β β β¬β¬β¬β¬β¬β¬β¬β¬ β
β β¬β¬β¬β¬β¬β¬β¬β¬ β β β¬β¬β¬β¬β¬β¬β¬β¬ β
β β¬β¬β¬β¬β¬β¬β¬β¬ β β β¬β¬β¬β¬β¬β¬β¬β¬ β
β β¬β¬β¬β¬β¬β¬β¬β¬ β β β¬β¬β¬β¬β¬β¬β¬β¬ β
β β¬β¬β¬β¬β¬β¬β¬β¬ β β β¬β¬β¬β¬β¬β¬β¬β¬ β
β β¬β¬β¬β¬β¬β¬β¬β¬ β β β¬β¬β¬β¬β¬β¬β¬β¬ β
β β¬β¬β¬β¬β¬β¬β¬β¬ β β β¬β¬β¬β¬β¬β¬β¬β¬ β
β β¬β¬β¬β¬β¬β¬β¬β¬ β β β¬β¬β¬β¬β¬β¬β¬β¬ β
ββββββββββββββββββββ ββββββββββββββββββββ
Node 1:
ββββββββββββββββββββ ββββββββββββββββββββ
β β¬β¬β¬β¬β¬β¬β¬β¬ β β β¬β¬β¬β¬β¬β¬β¬β¬ β
β β¬β¬β¬β¬β¬β¬β¬β¬ β β β¬β¬β¬β¬β¬β¬β¬β¬ β
β β¬β¬β¬β¬β¬β¬β¬β¬ β β β¬β¬β¬β¬β¬β¬β¬β¬ β
β β¬β¬β¬β¬β¬β¬β¬β¬ β β β¬β¬β¬β¬β¬β¬β¬β¬ β
β β¬β¬β¬β¬β¬β¬β¬β¬ β β β¬β¬β¬β¬β¬β¬β¬β¬ β
β β¬β¬β¬β¬β¬β¬β¬β¬ β β β¬β¬β¬β¬β¬β¬β¬β¬ β
β β¬β¬β¬β¬β¬β¬β¬β¬ β β β¬β¬β¬β¬β¬β¬β¬β¬ β
β β¬β¬β¬β¬β¬β¬β¬β¬ β β β¬β¬β¬β¬β¬β¬β¬β¬ β
ββββββββββββββββββββ ββββββββββββββββββββ
Now we initialize the nodes with 2 processes each and mapby node:
Node 0:
ββββββββββββββββββββ ββββββββββββββββββββ
β β
β¬β¬β¬β¬β¬β¬β¬ β β β¬β¬β¬β¬β¬β¬β¬β¬ β
β β¬β¬β¬β¬β¬β¬β¬β¬ β β β¬β¬β¬β¬β¬β¬β¬β¬ β
β β¬β¬β¬β¬β¬β¬β¬β¬ β β β¬β¬β¬β¬β¬β¬β¬β¬ β
β β¬β¬β¬β¬β¬β¬β¬β¬ β β β¬β¬β¬β¬β¬β¬β¬β¬ β
β β¬β¬β¬β¬β¬β¬β¬β¬ β β β¬β¬β¬β¬β¬β¬β¬β¬ β
β β¬β¬β¬β¬β¬β¬β¬β¬ β β β¬β¬β¬β¬β¬β¬β¬β¬ β
β β¬β¬β¬β¬β¬β¬β¬β¬ β β β¬β¬β¬β¬β¬β¬β¬β¬ β
β β¬β¬β¬β¬β¬β¬β¬β¬ β β β¬β¬β¬β¬β¬β¬β¬β¬ β
ββββββββββββββββββββ ββββββββββββββββββββ
Node 1:
ββββββββββββββββββββ ββββββββββββββββββββ
β β
β¬β¬β¬β¬β¬β¬β¬ β β β¬β¬β¬β¬β¬β¬β¬β¬ β
β β¬β¬β¬β¬β¬β¬β¬β¬ β β β¬β¬β¬β¬β¬β¬β¬β¬ β
β β¬β¬β¬β¬β¬β¬β¬β¬ β β β¬β¬β¬β¬β¬β¬β¬β¬ β
β β¬β¬β¬β¬β¬β¬β¬β¬ β β β¬β¬β¬β¬β¬β¬β¬β¬ β
β β¬β¬β¬β¬β¬β¬β¬β¬ β β β¬β¬β¬β¬β¬β¬β¬β¬ β
β β¬β¬β¬β¬β¬β¬β¬β¬ β β β¬β¬β¬β¬β¬β¬β¬β¬ β
β β¬β¬β¬β¬β¬β¬β¬β¬ β β β¬β¬β¬β¬β¬β¬β¬β¬ β
β β¬β¬β¬β¬β¬β¬β¬β¬ β β β¬β¬β¬β¬β¬β¬β¬β¬ β
ββββββββββββββββββββ ββββββββββββββββββββ
Now we re-allocate the processes with socket mapping and going up to 132 processes
Let's see where each process is:
Node 0:
βββββββββββββββββββββββββββββββββββ βββββββββββββββββββββββββββββββββββ
β 0 2 4 6 8 10 12 14 β β 1 3 5 7 9 11 13 15 β
β 16 18 20 22 24 26 28 30 β β 17 19 21 23 25 27 29 31 β
β 32 34 36 38 40 42 44 46 β β 33 35 37 39 41 43 45 47 β
β 48 50 52 54 56 58 60 62 β β 49 51 53 55 57 59 61 63 β
β 64 66 68 70 72 74 76 78 β β 65 67 69 71 73 75 77 79 β
β 80 82 84 86 88 90 92 94 β β 81 83 85 87 89 91 93 95 β
β 96 98 100 102 104 106 108 110 β β 97 99 101 103 105 107 109 111 β
β 112 114 116 118 120 122 124 126 β β 113 115 117 119 121 123 125 127 β
βββββββββββββββββββββββββββββββββββ βββββββββββββββββββββββββββββββββββ
Node 1:
βββββββββββββββββββββββββββββββββββ βββββββββββββββββββββββββββββββββββ
β 128 130 β β 129 131 β
β β β β
β β β β
β β β β
β β β β
β β β β
β β β β
β β β β
βββββββββββββββββββββββββββββββββββ βββββββββββββββββββββββββββββββββββ
We now re-allocated the node filling them completely with 256 processes and mapby core
In fact here is their status:
Node 0:
ββββββββββββββββββββ ββββββββββββββββββββ
β β
β
β
β
β
β
β
β
β β β
β
β
β
β
β
β
β
β
β β
β
β
β
β
β
β
β
β β β
β
β
β
β
β
β
β
β
β β
β
β
β
β
β
β
β
β β β
β
β
β
β
β
β
β
β
β β
β
β
β
β
β
β
β
β β β
β
β
β
β
β
β
β
β
β β
β
β
β
β
β
β
β
β β β
β
β
β
β
β
β
β
β
β β
β
β
β
β
β
β
β
β β β
β
β
β
β
β
β
β
β
β β
β
β
β
β
β
β
β
β β β
β
β
β
β
β
β
β
β
β β
β
β
β
β
β
β
β
β β β
β
β
β
β
β
β
β
β
ββββββββββββββββββββ ββββββββββββββββββββ
Active cores: 128 / 128
Empty cores: 0 /128
Active sockets: 2 / 2
Empty sockets: 0 / 2
Node 1:
ββββββββββββββββββββ ββββββββββββββββββββ
β β
β
β
β
β
β
β
β
β β β
β
β
β
β
β
β
β
β
β β
β
β
β
β
β
β
β
β β β
β
β
β
β
β
β
β
β
β β
β
β
β
β
β
β
β
β β β
β
β
β
β
β
β
β
β
β β
β
β
β
β
β
β
β
β β β
β
β
β
β
β
β
β
β
β β
β
β
β
β
β
β
β
β β β
β
β
β
β
β
β
β
β
β β
β
β
β
β
β
β
β
β β β
β
β
β
β
β
β
β
β
β β
β
β
β
β
β
β
β
β β β
β
β
β
β
β
β
β
β
β β
β
β
β
β
β
β
β
β β β
β
β
β
β
β
β
β
β
ββββββββββββββββββββ ββββββββββββββββββββ
Active cores: 128 / 128
Empty cores: 0 /128
Active sockets: 2 / 2
Empty sockets: 0 / 2
We can simulate different collective operations seeing their latency, for instance here for a message size of 1B:
- linear broadcast latency: 50.573211704623475 us
- chain broadcast latency: 55.030291447907615 us
- binary broadcast latency: 11.795976127944595 us
- linear reduce latency: 0.29013244483611006 us
- chain reduce latency: 34.846116497184674 us
- binary reduce latency: 1.785089908861254 us
Distributed under the MIT License. See LICENSE.txt
for more information.
Contact Me | |
---|---|
marcotallone85@gmail.com | |
LinkedIn Page | |
GitHub | marcotallone |