Skip to content

Modelling of MPI collective operations latencies: Broadcast and Reduce operations. UniTS, SDIC, 2023-2024

License

Notifications You must be signed in to change notification settings

marcotallone/collective-operations-latency

Repository files navigation

Forks Stargazers Issues MIT License LinkedIn Gmail


Modeling of MPI Collective Operations

A case study of the MPI Broadcast and MPI Reduce collective operations.
Explore the docs Β»
Read the official report Β»

View Demo Β· Report Bug Β· Request Feature

πŸ“‘ Table of Contents
  1. About The Project
  2. Getting Started
  3. Usage
  4. License
  5. Contact
  6. References
  7. Acknowledgments

About The Project

This repository contains collected data and analysis of the latencies for the MPI_Bcast and MPI_Reduce collective operations. This study has been conducted as part of the final exam for the High Performance Computing (HPC) corse held at the University of Trieste (UniTS) during the academic year 2023-2024.
The data have been mainly collected on EPYC nodes on the ORFEO cluster at AREA Science Park, Basovizza (TS) in January/February 2024 using the well known OSU benchmark and are available in the datasets/ folder. These nodes are equipped with AMD EPYC 7H12 (Rome) processors.
More in depth informations and further details are available in the attached report you can find in this repository.
This repository also contains a Python package named epyc containing a small simulative model for EPYC nodes that has been used to conduct the necessary computations for the analysis. The module is essentially a collection of classes and methods that allows the user to simulate MPI core allocation on a real EPYC machine. The module, in fact, allows to create Node objects and to initialize a certain number of processes on them according to different mapping policies, as done by the map-by option of the mpirun command of the MPI library.
The module is also able to simulate the latency of the MPI_Broadcast and MPI_Reduce collective operations on the EPYC nodes, using the data collected on the ORFEO cluster. The latency is predicted based on a point-to-point communication model. Once more, further details on the model are available on the report in this repository. The data have been collected through the submission of several jobs that can be found in the jobs/ folder.
The module also offers few utility functions to plot and to perform statistical analysis on the collected data collected for the latencies.
The majority of the implemented functions and classes are documented, hence further info about inputs and usage can be obtained with the help function in Python.
Some usage examples can be found in the apps/ folder or in the Jupiter notebooks in the notebooks/ folder.

Built With

C Python NeoVim Bash

(back to top)

Getting Started

If you want to use the implemented epyc module you can follow these steps. It is anyway recommended to take a look at the scripts in the apps/ folder for eventual usage examples.

(back to top)

Prerequisites

Prerequisites needed to repeat the measurements and the data analysis:

  • Python 3.10 or higher
  • OSU MPI Benchmarks installed on the target machine (version 7.3)
  • OpenMPI library installed on the target machine
  • Access to ORFEO cluster or any equivalent HPC platform with SLURM scheduler

Installation

The module comes with a setup.py file in the root directory, hence it can be installed with the following command:

pip install -e .

from the root directory of the project. After that, the module can be imported in any Python script or notebook with the following command:

import epyc

Or, alternatively to also use the utilities functions one can import:

from epyc import *
from utils import *

Alternatively the modules can be used by manually updating the PYTHONPATH environment before running the scripts or notebooks.

(back to top)

Usage

The example.py script in the apps/ folder contains some usage examples of the implemented classes and methods. The script can be run with the following command:

python apps/examples.py

from the root directory. Running the script will produce the following output:

Now these nodes are empty:

Node 0:
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ ⬛⬛⬛⬛⬛⬛⬛⬛ β”‚  β”‚ ⬛⬛⬛⬛⬛⬛⬛⬛ β”‚
β”‚ ⬛⬛⬛⬛⬛⬛⬛⬛ β”‚  β”‚ ⬛⬛⬛⬛⬛⬛⬛⬛ β”‚
β”‚ ⬛⬛⬛⬛⬛⬛⬛⬛ β”‚  β”‚ ⬛⬛⬛⬛⬛⬛⬛⬛ β”‚
β”‚ ⬛⬛⬛⬛⬛⬛⬛⬛ β”‚  β”‚ ⬛⬛⬛⬛⬛⬛⬛⬛ β”‚
β”‚ ⬛⬛⬛⬛⬛⬛⬛⬛ β”‚  β”‚ ⬛⬛⬛⬛⬛⬛⬛⬛ β”‚
β”‚ ⬛⬛⬛⬛⬛⬛⬛⬛ β”‚  β”‚ ⬛⬛⬛⬛⬛⬛⬛⬛ β”‚
β”‚ ⬛⬛⬛⬛⬛⬛⬛⬛ β”‚  β”‚ ⬛⬛⬛⬛⬛⬛⬛⬛ β”‚
β”‚ ⬛⬛⬛⬛⬛⬛⬛⬛ β”‚  β”‚ ⬛⬛⬛⬛⬛⬛⬛⬛ β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Node 1:
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ ⬛⬛⬛⬛⬛⬛⬛⬛ β”‚  β”‚ ⬛⬛⬛⬛⬛⬛⬛⬛ β”‚
β”‚ ⬛⬛⬛⬛⬛⬛⬛⬛ β”‚  β”‚ ⬛⬛⬛⬛⬛⬛⬛⬛ β”‚
β”‚ ⬛⬛⬛⬛⬛⬛⬛⬛ β”‚  β”‚ ⬛⬛⬛⬛⬛⬛⬛⬛ β”‚
β”‚ ⬛⬛⬛⬛⬛⬛⬛⬛ β”‚  β”‚ ⬛⬛⬛⬛⬛⬛⬛⬛ β”‚
β”‚ ⬛⬛⬛⬛⬛⬛⬛⬛ β”‚  β”‚ ⬛⬛⬛⬛⬛⬛⬛⬛ β”‚
β”‚ ⬛⬛⬛⬛⬛⬛⬛⬛ β”‚  β”‚ ⬛⬛⬛⬛⬛⬛⬛⬛ β”‚
β”‚ ⬛⬛⬛⬛⬛⬛⬛⬛ β”‚  β”‚ ⬛⬛⬛⬛⬛⬛⬛⬛ β”‚
β”‚ ⬛⬛⬛⬛⬛⬛⬛⬛ β”‚  β”‚ ⬛⬛⬛⬛⬛⬛⬛⬛ β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
Now we initialize the nodes with 2 processes each and mapby node:

Node 0:
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ βœ…β¬›β¬›β¬›β¬›β¬›β¬›β¬› β”‚  β”‚ ⬛⬛⬛⬛⬛⬛⬛⬛ β”‚
β”‚ ⬛⬛⬛⬛⬛⬛⬛⬛ β”‚  β”‚ ⬛⬛⬛⬛⬛⬛⬛⬛ β”‚
β”‚ ⬛⬛⬛⬛⬛⬛⬛⬛ β”‚  β”‚ ⬛⬛⬛⬛⬛⬛⬛⬛ β”‚
β”‚ ⬛⬛⬛⬛⬛⬛⬛⬛ β”‚  β”‚ ⬛⬛⬛⬛⬛⬛⬛⬛ β”‚
β”‚ ⬛⬛⬛⬛⬛⬛⬛⬛ β”‚  β”‚ ⬛⬛⬛⬛⬛⬛⬛⬛ β”‚
β”‚ ⬛⬛⬛⬛⬛⬛⬛⬛ β”‚  β”‚ ⬛⬛⬛⬛⬛⬛⬛⬛ β”‚
β”‚ ⬛⬛⬛⬛⬛⬛⬛⬛ β”‚  β”‚ ⬛⬛⬛⬛⬛⬛⬛⬛ β”‚
β”‚ ⬛⬛⬛⬛⬛⬛⬛⬛ β”‚  β”‚ ⬛⬛⬛⬛⬛⬛⬛⬛ β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Node 1:
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ βœ…β¬›β¬›β¬›β¬›β¬›β¬›β¬› β”‚  β”‚ ⬛⬛⬛⬛⬛⬛⬛⬛ β”‚
β”‚ ⬛⬛⬛⬛⬛⬛⬛⬛ β”‚  β”‚ ⬛⬛⬛⬛⬛⬛⬛⬛ β”‚
β”‚ ⬛⬛⬛⬛⬛⬛⬛⬛ β”‚  β”‚ ⬛⬛⬛⬛⬛⬛⬛⬛ β”‚
β”‚ ⬛⬛⬛⬛⬛⬛⬛⬛ β”‚  β”‚ ⬛⬛⬛⬛⬛⬛⬛⬛ β”‚
β”‚ ⬛⬛⬛⬛⬛⬛⬛⬛ β”‚  β”‚ ⬛⬛⬛⬛⬛⬛⬛⬛ β”‚
β”‚ ⬛⬛⬛⬛⬛⬛⬛⬛ β”‚  β”‚ ⬛⬛⬛⬛⬛⬛⬛⬛ β”‚
β”‚ ⬛⬛⬛⬛⬛⬛⬛⬛ β”‚  β”‚ ⬛⬛⬛⬛⬛⬛⬛⬛ β”‚
β”‚ ⬛⬛⬛⬛⬛⬛⬛⬛ β”‚  β”‚ ⬛⬛⬛⬛⬛⬛⬛⬛ β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
Now we re-allocate the processes with socket mapping and going up to 132 processes
Let's see where each process is:

Node 0:
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚   0   2   4   6   8  10  12  14 β”‚  β”‚   1   3   5   7   9  11  13  15 β”‚
β”‚  16  18  20  22  24  26  28  30 β”‚  β”‚  17  19  21  23  25  27  29  31 β”‚
β”‚  32  34  36  38  40  42  44  46 β”‚  β”‚  33  35  37  39  41  43  45  47 β”‚
β”‚  48  50  52  54  56  58  60  62 β”‚  β”‚  49  51  53  55  57  59  61  63 β”‚
β”‚  64  66  68  70  72  74  76  78 β”‚  β”‚  65  67  69  71  73  75  77  79 β”‚
β”‚  80  82  84  86  88  90  92  94 β”‚  β”‚  81  83  85  87  89  91  93  95 β”‚
β”‚  96  98 100 102 104 106 108 110 β”‚  β”‚  97  99 101 103 105 107 109 111 β”‚
β”‚ 112 114 116 118 120 122 124 126 β”‚  β”‚ 113 115 117 119 121 123 125 127 β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Node 1:
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ 128 130                         β”‚  β”‚ 129 131                         β”‚
β”‚                                 β”‚  β”‚                                 β”‚
β”‚                                 β”‚  β”‚                                 β”‚
β”‚                                 β”‚  β”‚                                 β”‚
β”‚                                 β”‚  β”‚                                 β”‚
β”‚                                 β”‚  β”‚                                 β”‚
β”‚                                 β”‚  β”‚                                 β”‚
β”‚                                 β”‚  β”‚                                 β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
We now re-allocated the node filling them completely with 256 processes and mapby core
In fact here is their status:

Node 0:
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ βœ…βœ…βœ…βœ…βœ…βœ…βœ…βœ… β”‚  β”‚ βœ…βœ…βœ…βœ…βœ…βœ…βœ…βœ… β”‚
β”‚ βœ…βœ…βœ…βœ…βœ…βœ…βœ…βœ… β”‚  β”‚ βœ…βœ…βœ…βœ…βœ…βœ…βœ…βœ… β”‚
β”‚ βœ…βœ…βœ…βœ…βœ…βœ…βœ…βœ… β”‚  β”‚ βœ…βœ…βœ…βœ…βœ…βœ…βœ…βœ… β”‚
β”‚ βœ…βœ…βœ…βœ…βœ…βœ…βœ…βœ… β”‚  β”‚ βœ…βœ…βœ…βœ…βœ…βœ…βœ…βœ… β”‚
β”‚ βœ…βœ…βœ…βœ…βœ…βœ…βœ…βœ… β”‚  β”‚ βœ…βœ…βœ…βœ…βœ…βœ…βœ…βœ… β”‚
β”‚ βœ…βœ…βœ…βœ…βœ…βœ…βœ…βœ… β”‚  β”‚ βœ…βœ…βœ…βœ…βœ…βœ…βœ…βœ… β”‚
β”‚ βœ…βœ…βœ…βœ…βœ…βœ…βœ…βœ… β”‚  β”‚ βœ…βœ…βœ…βœ…βœ…βœ…βœ…βœ… β”‚
β”‚ βœ…βœ…βœ…βœ…βœ…βœ…βœ…βœ… β”‚  β”‚ βœ…βœ…βœ…βœ…βœ…βœ…βœ…βœ… β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
Active cores:	128 / 128
Empty cores:	0 /128
Active sockets:	2 / 2
Empty sockets:	0 / 2

Node 1:
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ βœ…βœ…βœ…βœ…βœ…βœ…βœ…βœ… β”‚  β”‚ βœ…βœ…βœ…βœ…βœ…βœ…βœ…βœ… β”‚
β”‚ βœ…βœ…βœ…βœ…βœ…βœ…βœ…βœ… β”‚  β”‚ βœ…βœ…βœ…βœ…βœ…βœ…βœ…βœ… β”‚
β”‚ βœ…βœ…βœ…βœ…βœ…βœ…βœ…βœ… β”‚  β”‚ βœ…βœ…βœ…βœ…βœ…βœ…βœ…βœ… β”‚
β”‚ βœ…βœ…βœ…βœ…βœ…βœ…βœ…βœ… β”‚  β”‚ βœ…βœ…βœ…βœ…βœ…βœ…βœ…βœ… β”‚
β”‚ βœ…βœ…βœ…βœ…βœ…βœ…βœ…βœ… β”‚  β”‚ βœ…βœ…βœ…βœ…βœ…βœ…βœ…βœ… β”‚
β”‚ βœ…βœ…βœ…βœ…βœ…βœ…βœ…βœ… β”‚  β”‚ βœ…βœ…βœ…βœ…βœ…βœ…βœ…βœ… β”‚
β”‚ βœ…βœ…βœ…βœ…βœ…βœ…βœ…βœ… β”‚  β”‚ βœ…βœ…βœ…βœ…βœ…βœ…βœ…βœ… β”‚
β”‚ βœ…βœ…βœ…βœ…βœ…βœ…βœ…βœ… β”‚  β”‚ βœ…βœ…βœ…βœ…βœ…βœ…βœ…βœ… β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
Active cores:	128 / 128
Empty cores:	0 /128
Active sockets:	2 / 2
Empty sockets:	0 / 2
We can simulate different collective operations seeing their latency, for instance here for a message size of 1B:
	 - linear broadcast latency: 50.573211704623475 us
	 - chain broadcast latency: 55.030291447907615 us
	 - binary broadcast latency: 11.795976127944595 us
	 - linear reduce latency: 0.29013244483611006 us
	 - chain reduce latency: 34.846116497184674 us
	 - binary reduce latency: 1.785089908861254 us

(back to top)

License

Distributed under the MIT License. See LICENSE.txt for more information.

(back to top)

Contact

Contact Me
Mail marcotallone85@gmail.com
LinkedIn LinkedIn Page
GitHub marcotallone

(back to top)

References

(back to top)

Acknowledgments

(back to top)

About

Modelling of MPI collective operations latencies: Broadcast and Reduce operations. UniTS, SDIC, 2023-2024

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published