Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Load Balancing is slower #321

Open
jose-pinzon0407 opened this issue Jun 4, 2024 · 4 comments
Open

Load Balancing is slower #321

jose-pinzon0407 opened this issue Jun 4, 2024 · 4 comments
Assignees
Labels
bug Something isn't working load balancing performance Performance enhancements

Comments

@jose-pinzon0407
Copy link

Using load balancing GeneralDomainDecomposition is slower than DomainDecomposition, opposite to what is reported.

Used version ls1-MarDyn version 1.2.0_master_98d77b7d3, commit 98d77b7 compiled with gcc 12

Main Problem

The problem was tested using the droplet coalescence example. The issue was found using the last script be run at examples DropletCoalescene/vle/config_5_droplet.xml. This problem was found using 4 nodes with 72 tasks per node on HSuper.
The process should run faster, if the parallelisation node is set to GeneralDomainDecomposition and the loadBalancer node is set to all. This should be faster than setting parallelisation to DomainDecomposition.

The configurations used are shown in the simulation output, not in consecutive lines, as:

DomainDecomposition

Parallelisation type: DomainDecomposition
DomainDecompMPIBase: Using IndirectCommunicationScheme

GeneralDomainDecomposition

Parallelisation type: GeneralDomainDecomposition
DomainDecompMPIBase: Using DirectCommunicationScheme with push-pull neighbors
Chosen Load Balancer: all

Parallel Issues

Both configurations can be directly compared by checking the output line Decomposition took, for the previously described case we have two different measured times:

  • GeneralDomainDecomposition: Decomposition took: 32.665 sec
  • DomainDecomposition: Decomposition took: 9.26896 sec

These times were obtained with the settings given above.

NeighbourCommunicationScheme

Starting at the DomainDecompMPIBase class, the issue can be traced to the routine exchangeMoleculesMPI, which calls the class member _neighbourCommunicationsScheme and its member function exchangeMoleculesMPI. Using the Timer class inside that function (found in parallel/NeighbourCommunicationScheme.cpp) for the DirectNeighbourCommunicationScheme derived class shows that most of the time is spent inside an if statement which checks if msgType == LEAVING_AND_HALO_COPIES. Further investigation is required.

Adding timers

If someone wants to add timers to check this issue do as follows:

  1. In io/TimerProfiler.cpp modify the function readInitialTimersFromFile by adding your own category (optional) to timerClasses, and your own timer to timerAttrs (mandatory).
  2. In Simulation.cpp modify preSimLoopSteps by adding the output strings to your respective timers. The names have to match on both files.
  3. In the file you want to measure at, create a pointer to your timer using the given name from the TimerProfiler class in step 1, by using global_simulation->timers()->getTimer("NAME") . Then use start and stop.
  4. Your measured time should be visible in the output, if you created your own category at the end of the list in TimerProfiler then they should be at the end.
@FG-TUM FG-TUM added bug Something isn't working load balancing performance Performance enhancements labels Jun 4, 2024
@FG-TUM FG-TUM self-assigned this Aug 12, 2024
@FG-TUM
Copy link
Member

FG-TUM commented Sep 3, 2024

Alternative, more simple setup that can reproduce the issue on a smaller scale:

Run both alternatives option A and B from the following config.

config.xml
<?xml version='1.0' encoding='UTF-8'?>
<mardyn version="20100525" >

  <refunits type="SI" >
    <length unit="nm">0.1</length>
    <mass unit="u">1</mass>
    <energy unit="K">1</energy>
  </refunits>

  <simulation type="MD" >
    <integrator type="Leapfrog" >
      <timestep unit="reduced" >0.00182367</timestep>
    </integrator>

    <run>
      <currenttime>0.0</currenttime>
      <production>
        <steps>100</steps>
      </production>
    </run>

    <ensemble type="NVT">
      <temperature unit="reduced" >110</temperature>
      <domain type="box">
        <lx>100</lx>
        <ly>100</ly>
        <lz>100</lz>
      </domain>

      <components>
        <moleculetype id="1" name="LJ">
          <site type="LJ126" id="1" name="Ar">
            <coords> <x>0.0</x> <y>0.0</y> <z>0.0</z> </coords>
            <mass>1</mass>
            <sigma>1</sigma>
            <epsilon>1</epsilon>
            <shifted>true</shifted>
          </site>
        </moleculetype>
      </components>

      <phasespacepoint>
        <generator name="MultiObjectGenerator">
          <objectgenerator>
            <filler type="GridFiller">
              <lattice system="cubic" centering="face">
                <vec id="a"> <x>1</x> <y>0</y> <z>0</z> </vec>
                <vec id="b"> <x>0</x> <y>1</y> <z>0</z> </vec>
                <vec id="c"> <x>0</x> <y>0</y> <z>1</z> </vec>
              </lattice>
              <basis>
                <site>
                  <componentid>0</componentid>
                  <coordinate> <x>0.5</x> <y>0.5</y> <z>0.5</z> </coordinate>
                </site>
              </basis>
              <latticeOccupancy>1</latticeOccupancy>
              <density>0.000501</density>
            </filler>
            <object type="Cuboid">
              <lower> <x>0</x> <y>0</y> <z>0</z> </lower>
              <upper> <x>100</x> <y>100</y> <z>100</z> </upper>
            </object>
            <velocityAssigner type="EqualVelocityDistribution"></velocityAssigner>
          </objectgenerator>
        </generator>
      </phasespacepoint>
    </ensemble>

    <algorithm>
<!-- ======================== RELEVANT PART START ======================== -->
      <!-- OPTION A -->
      <parallelisation type="DomainDecomposition">
        <timerForLoad>SIMULATION_FORCE_CALCULATION</timerForLoad>
      </parallelisation>

      <!-- OPTION B -->
      <parallelisation type="GeneralDomainDecomposition">
        <timerForLoad>SIMULATION_FORCE_CALCULATION</timerForLoad>
        <updateFrequency>5</updateFrequency>
        <loadBalancer type="ALL"/>
      </parallelisation>
<!-- ======================== RELEVANT PART END ========================== -->

      <datastructure type="LinkedCells">
        <cellsInCutoffRadius>1</cellsInCutoffRadius>
      </datastructure>
      <cutoffs type="CenterOfMass" >
      <defaultCutoff unit="reduced" >2.5</defaultCutoff>
        <radiusLJ unit="reduced" >2.5</radiusLJ>
      </cutoffs>
      <electrostatic type="ReactionField" >
        <epsilon>1.0e+10</epsilon>
      </electrostatic>
    </algorithm>

    <output>
    </output>
  </simulation>
</mardyn>

Run with e.g.

OMP_NUM_THREADS mpirun -n 16 --oversubscribe src/MarDyn config.xml | tee output_OptX.out

Then compare the two outputs via:

grep -Erni -A 5 'Decomposition took' output*.out

The output from GeneralDomainDecomposition should show longer Decomposition time, especially significantly higher Communication time. This value is measured by the timer SIMULATION_MPI_OMP_COMMUNICATION which only measures the call to _domainDecomposition->balanceAndExchangeInitNonBlocking() in Simulation.cpp.

@FG-TUM
Copy link
Member

FG-TUM commented Sep 3, 2024

The difference from the perspective of the decompositions is that while both call DomainDecompMPIBase::exchangeMoleculesMPI(), they use different modes:

  • DomainDecomposition : LEAVING_AND_HALO_COPIES
  • GeneralDomainDecomposition : LEAVING_ONLY and HALO_COPIES

This however shouldn't really make a difference because the two calls from GDD should do exactly the same as the one call from DD. The latter also invokes moleculeContainer->deleteOuterParticles() but GDD also does this between the two calls.

@FG-TUM
Copy link
Member

FG-TUM commented Sep 4, 2024

The non-rebalance case n GeneralDomainDecomposition is exactly the same as in DomainDecomposition and thus also equally fast (confirmed by tests!).

Throwing counters around every single method call in the rebalance part of GeneralDomainDecomposition::balanceAndExchange indicates that GeneralDomainDecomposition::initCommPartners takes the most time by a lot.

Rough sketch how this was done:

  1. Add a class member and initialize it:
    std::map<std::string, long> counters{
        {"Rebalancing - numIterations", 0},
        {"Rebalancing - exchangeMoleculesMPI 1", 0},
        {"Rebalancing - deleteOuterParticles", 0},
        {"Rebalancing - rebalance", 0},
        {"Rebalancing - latchToGridSize", 0},
        {"Rebalancing - migrateParticles", 0},
        {"Rebalancing - initCommPartners", 0},
        {"Rebalancing - exchangeMoleculesMPI 2", 0},
        {"SendLeavingWithCopies - numIterations", 0},
        {"SendLeavingWithCopies - exchangeMoleculesMPI", 0},
    };
  2. Add measurements around every method call e.g.:
    auto start = std::chrono::high_resolution_clock::now();
    DomainDecompMPIBase::exchangeMoleculesMPI(moleculeContainer, domain, LEAVING_ONLY);
    counters["Rebalancing - exchangeMoleculesMPI 1"] += std::chrono::duration_cast<std::chrono::nanoseconds>(std::chrono::high_resolution_clock::now() - start).count();
  3. Dump everything in the destructor into one file per rank:
    GeneralDomainDecomposition::~GeneralDomainDecomposition() {
    
    	int myRank{};
    	MPI_Comm_rank(MPI_COMM_WORLD, &myRank);
    
    	std::ofstream logFile{"counters_r" + std::to_string(myRank)};
    	for (const auto &[name, value] : counters) {
    		logFile << name << " = " << value << "\n";
    	}
    }
  4. Pretty-print and investigate the top 3 offenders per Rank:
    for f in counters_r* 
    do
      cat $f \
        | column -t -s '-=' -o '|' \
        | sort -t '|' -k3,3gr -o $f
    done
    head -n 3 counters_r*

@FG-TUM
Copy link
Member

FG-TUM commented Sep 5, 2024

After adding extensive timers to

  • GeneralDomainDecomposition::balanceAndExchange()
  • DirectNeighbourCommunicationScheme::initCommunicationPartners()
  • NeighborAcquirer::acquireNeighbors()
diff file to add the counters

diff --git a/src/parallel/GeneralDomainDecomposition.cpp b/src/parallel/GeneralDomainDecomposition.cpp
index efd6aa750..e323e0c07 100644
--- a/src/parallel/GeneralDomainDecomposition.cpp
+++ b/src/parallel/GeneralDomainDecomposition.cpp
@@ -69,7 +69,16 @@ void GeneralDomainDecomposition::initializeALL() {
 
 double GeneralDomainDecomposition::getBoundingBoxMin(int dimension, Domain* /*domain*/) { return _boxMin[dimension]; }
 
-GeneralDomainDecomposition::~GeneralDomainDecomposition() = default;
+GeneralDomainDecomposition::~GeneralDomainDecomposition() {
+
+	int myRank{};
+	MPI_Comm_rank(MPI_COMM_WORLD, &myRank);
+
+	std::ofstream logFile{"counters_r" + std::to_string(myRank)};
+	for (const auto &[name, value] : counters) {
+		logFile << name << " = " << value << std::endl;
+	}
+}
 
 double GeneralDomainDecomposition::getBoundingBoxMax(int dimension, Domain* /*domain*/) { return _boxMax[dimension]; }
 
@@ -80,9 +89,31 @@ bool GeneralDomainDecomposition::queryRebalancing(size_t step, size_t updateFreq
 
 void GeneralDomainDecomposition::balanceAndExchange(double lastTraversalTime, bool forceRebalancing,
 													ParticleContainer* moleculeContainer, Domain* domain) {
+
+	if (counters.empty()) {
+		counters["Rebalancing - numIterations"] = 0;
+		counters["Rebalancing - exchangeMoleculesMPI 1"] = 0;
+		counters["Rebalancing - deleteOuterParticles"] = 0;
+		counters["Rebalancing - rebalance"] = 0;
+		counters["Rebalancing - latchToGridSize"] = 0;
+		counters["Rebalancing - migrateParticles"] = 0;
+		counters["Rebalancing - initCommPartners"] = 0;
+		counters["Rebalancing - exchangeMoleculesMPI 2"] = 0;
+		counters["SendLeavingWithCopies - numIterations"] = 0;
+		counters["SendLeavingWithCopies - exchangeMoleculesMPI"] = 0;
+	}
+
 	const bool rebalance =
 		queryRebalancing(_steps, _rebuildFrequency, _initPhase, _initFrequency, lastTraversalTime) or forceRebalancing;
+	const auto moep = [&](const auto &msg) {
+		int myRank{0};
+		MPI_Comm_rank(MPI_COMM_WORLD, &myRank);
+		if (myRank == 0 ) {
+			std::cout << msg << "\n";
+		}
+	};
 	if (_steps == 0) {
+		moep("_steps == 0");
 		// ensure that there are no outer particles
 		moleculeContainer->deleteOuterParticles();
 		// init communication partners
@@ -90,11 +121,16 @@ void GeneralDomainDecomposition::balanceAndExchange(double lastTraversalTime, bo
 		DomainDecompMPIBase::exchangeMoleculesMPI(moleculeContainer, domain, HALO_COPIES);
 	} else {
 		if (rebalance) {
+			++counters["Rebalancing - numIterations"];
 			// first transfer leaving particles
+			auto start = std::chrono::high_resolution_clock::now();
 			DomainDecompMPIBase::exchangeMoleculesMPI(moleculeContainer, domain, LEAVING_ONLY);
+			counters["Rebalancing - exchangeMoleculesMPI 1"] += std::chrono::duration_cast<std::chrono::nanoseconds>(std::chrono::high_resolution_clock::now() - start).count();
 
 			// ensure that there are no outer particles
+			start = std::chrono::high_resolution_clock::now();
 			moleculeContainer->deleteOuterParticles();
+			counters["Rebalancing - deleteOuterParticles"] += std::chrono::duration_cast<std::chrono::nanoseconds>(std::chrono::high_resolution_clock::now() - start).count();
 
 			// rebalance
 			Log::global_log->debug() << "rebalancing..." << std::endl;
@@ -102,13 +138,21 @@ void GeneralDomainDecomposition::balanceAndExchange(double lastTraversalTime, bo
 			Log::global_log->set_mpi_output_all();
 			Log::global_log->debug() << "work:" << lastTraversalTime << std::endl;
 			Log::global_log->set_mpi_output_root(0);
+			start = std::chrono::high_resolution_clock::now();
 			auto [newBoxMin, newBoxMax] = _loadBalancer->rebalance(lastTraversalTime);
+			counters["Rebalancing - rebalance"] += std::chrono::duration_cast<std::chrono::nanoseconds>(std::chrono::high_resolution_clock::now() - start).count();
+
 			if (_latchGridSize.has_value()) {
+				start = std::chrono::high_resolution_clock::now();
 				std::tie(newBoxMin, newBoxMax) = latchToGridSize(newBoxMin, newBoxMax);
+				counters["Rebalancing - latchToGridSize"] += std::chrono::duration_cast<std::chrono::nanoseconds>(std::chrono::high_resolution_clock::now() - start).count();
+
 			}
 			// migrate the particles, this will rebuild the moleculeContainer!
 			Log::global_log->debug() << "migrating particles" << std::endl;
+			start = std::chrono::high_resolution_clock::now();
 			migrateParticles(domain, moleculeContainer, newBoxMin, newBoxMax);
+			counters["Rebalancing - migrateParticles"] += std::chrono::duration_cast<std::chrono::nanoseconds>(std::chrono::high_resolution_clock::now() - start).count();
 
 #ifndef MARDYN_AUTOPAS
 			// The linked cells container needs this (I think just to set the cells to valid...)
@@ -121,13 +165,24 @@ void GeneralDomainDecomposition::balanceAndExchange(double lastTraversalTime, bo
 
 			// init communication partners
 			Log::global_log->debug() << "updating communication partners" << std::endl;
+			start = std::chrono::high_resolution_clock::now();
 			initCommPartners(moleculeContainer, domain);
+			counters["Rebalancing - initCommPartners"] += std::chrono::duration_cast<std::chrono::nanoseconds>(std::chrono::high_resolution_clock::now() - start).count();
+
 			Log::global_log->debug() << "rebalancing finished" << std::endl;
+			start = std::chrono::high_resolution_clock::now();
 			DomainDecompMPIBase::exchangeMoleculesMPI(moleculeContainer, domain, HALO_COPIES);
+			counters["Rebalancing - exchangeMoleculesMPI 2"] += std::chrono::duration_cast<std::chrono::nanoseconds>(std::chrono::high_resolution_clock::now() - start).count();
+
 		} else {
 			if (sendLeavingWithCopies()) {
+				++counters["SendLeavingWithCopies - numIterations"];
+				const auto start = std::chrono::high_resolution_clock::now();
 				DomainDecompMPIBase::exchangeMoleculesMPI(moleculeContainer, domain, LEAVING_AND_HALO_COPIES);
+				counters["SendLeavingWithCopies - exchangeMoleculesMPI"] += std::chrono::duration_cast<std::chrono::nanoseconds>(std::chrono::high_resolution_clock::now() - start).count();
+
 			} else {
+				moep("not sendLeavingWithCopies");
 				DomainDecompMPIBase::exchangeMoleculesMPI(moleculeContainer, domain, LEAVING_ONLY);
 				moleculeContainer->deleteOuterParticles();
 				DomainDecompMPIBase::exchangeMoleculesMPI(moleculeContainer, domain, HALO_COPIES);
diff --git a/src/parallel/GeneralDomainDecomposition.h b/src/parallel/GeneralDomainDecomposition.h
index 12dd37f02..253610c38 100644
--- a/src/parallel/GeneralDomainDecomposition.h
+++ b/src/parallel/GeneralDomainDecomposition.h
@@ -13,6 +13,7 @@
 #include "particleContainer/ParticleContainer.h"
 #include "DomainDecompMPIBase.h"
 #include "LoadBalancer.h"
+#include "map"
 
 /**
  * This decomposition is meant to be able to call arbitrary load balancers.
@@ -94,6 +95,8 @@ public:
 		throw std::runtime_error("GeneralDomainDecomposition::getNeighboursFromHaloRegion() not yet implemented");
 	}
 
+	std::map<std::string, long> counters{};
+
 private:
 	/**
 	 * Method that initializes the ALLLoadBalancer
diff --git a/src/parallel/NeighborAcquirer.cpp b/src/parallel/NeighborAcquirer.cpp
index 7f1d00784..6b74ba891 100644
--- a/src/parallel/NeighborAcquirer.cpp
+++ b/src/parallel/NeighborAcquirer.cpp
@@ -9,6 +9,30 @@
 #include "Domain.h"
 #include "HaloRegion.h"
 
+std::map<std::string, long> NeighborAcquirer::counters{
+	{"acquireNeighbors - init", 0},
+	{"acquireNeighbors - fill send buffer", 0},
+	{"acquireNeighbors - count global receive bytes (Allgather)", 0},
+	{"acquireNeighbors - build stride", 0},
+	{"acquireNeighbors - gather regions (Allgatherv)", 0},
+	{"acquireNeighbors - parse received regions", 0},
+	{"acquireNeighbors - merge char arrays", 0},
+	{"acquireNeighbors - count regions to send (Allreduce)", 0},
+	{"acquireNeighbors - send regions to send (Isend)", 0},
+	{"acquireNeighbors - receive regions and deserialize (Get_count, Recv)", 0},
+	{"acquireNeighbors - synchronize (Wait)", 0},
+	{"acquireNeighbors - barrier (Barrier)", 0},
+};
+void NeighborAcquirer::dumpCounters() {
+	int myRank{};
+	MPI_Comm_rank(MPI_COMM_WORLD, &myRank);
+
+	std::ofstream logFile{"countersNeighborAcqu_r" + std::to_string(myRank)};
+	for (const auto &[name, value] : counters) {
+		logFile << name << " = " << value << '\n';
+	}
+}
+
 /*
  * 1. Initial Exchange of all desired regions.
  * 2. Each process checks whether he owns parts of the desired regions and will save those regions in partners02.
@@ -21,13 +45,16 @@ std::tuple<std::vector<CommunicationPartner>, std::vector<CommunicationPartner>>
 	const std::array<double, 3> &globalDomainLength, HaloRegion *ownRegion, const std::vector<HaloRegion> &desiredRegions,
 	const MPI_Comm &comm, bool excludeOwnRank) {
 
+	auto start = std::chrono::high_resolution_clock::now();
 	int my_rank{};  // my rank
 	MPI_Comm_rank(comm, &my_rank);
 	int num_processes{};  // the number of processes in comm
 	MPI_Comm_size(comm, &num_processes);
 
 	const auto num_regions = desiredRegions.size();  // the number of regions I would like to acquire from other processes
+	counters["acquireNeighbors - init"] += std::chrono::duration_cast<std::chrono::nanoseconds>(std::chrono::high_resolution_clock::now() - start).count();
 
+	start = std::chrono::high_resolution_clock::now();
 	// tell the other processes how much you are going to send
 	// how many bytes am I going to send to all the other processes
 	const int num_bytes_send =
@@ -54,24 +81,31 @@ std::tuple<std::vector<CommunicationPartner>, std::vector<CommunicationPartner>>
 		memcpy(outgoingDesiredRegionsVector.data() + bufferPosition, &region.width, sizeof(double));
 		bufferPosition += sizeof(double);
 	}
+	counters["acquireNeighbors - fill send buffer"] += std::chrono::duration_cast<std::chrono::nanoseconds>(std::chrono::high_resolution_clock::now() - start).count();
 
+	start = std::chrono::high_resolution_clock::now();
 	// set up structure information data for the Allgatherv operation
 	// vector of number of bytes I am going to receive
 	std::vector<int> num_bytes_receive_vec(num_processes, 0);
 	MPI_Allgather(&num_bytes_send, 1, MPI_INT, num_bytes_receive_vec.data(), 1, MPI_INT, comm);
+	counters["acquireNeighbors - count global receive bytes (Allgather)"] += std::chrono::duration_cast<std::chrono::nanoseconds>(std::chrono::high_resolution_clock::now() - start).count();
 	// vector of offsets (=displacement in MPI) in the receive buffer
+	start = std::chrono::high_resolution_clock::now();
 	std::vector<int> num_bytes_displacements(num_processes, 0);
 	int num_bytes_receive = 0;
 	for (int j = 0; j < num_processes; j++) {
 		num_bytes_displacements[j] = num_bytes_receive;
 		num_bytes_receive += num_bytes_receive_vec[j];
 	}
+	counters["acquireNeighbors - build stride"] += std::chrono::duration_cast<std::chrono::nanoseconds>(std::chrono::high_resolution_clock::now() - start).count();
 
+	start = std::chrono::high_resolution_clock::now();
 	std::vector<unsigned char> incomingDesiredRegionsVector(num_bytes_receive);  // the incoming byte buffer
 
 	// send your regions
 	MPI_Allgatherv(outgoingDesiredRegionsVector.data(), num_bytes_send, MPI_BYTE, incomingDesiredRegionsVector.data(),
 				   num_bytes_receive_vec.data(), num_bytes_displacements.data(), MPI_BYTE, comm);
+	counters["acquireNeighbors - gather regions (Allgatherv)"] += std::chrono::duration_cast<std::chrono::nanoseconds>(std::chrono::high_resolution_clock::now() - start).count();
 
 	std::vector<int> numberOfRegionsToSendToRank(num_processes, 0);       // outgoing row
 
@@ -80,8 +114,8 @@ std::tuple<std::vector<CommunicationPartner>, std::vector<CommunicationPartner>>
 		sizeof(double) * 3 + sizeof(double) * 3 + sizeof(int) * 3 + sizeof(double) + sizeof(double) * 3;
 	// the regions I own and want to send: ranks<regions<regionData>>
 	std::vector<std::vector<std::vector<unsigned char>>> sendingList(num_processes);
-
 	std::vector<CommunicationPartner> comm_partners02{};
+	start = std::chrono::high_resolution_clock::now();
 	bufferPosition = 0;
 	while (bufferPosition < num_bytes_receive /*== buffer length*/) {
 
@@ -164,7 +198,9 @@ std::tuple<std::vector<CommunicationPartner>, std::vector<CommunicationPartner>>
 			}
 		}
 	}
+	counters["acquireNeighbors - parse received regions"] += std::chrono::duration_cast<std::chrono::nanoseconds>(std::chrono::high_resolution_clock::now() - start).count();
 
+	start = std::chrono::high_resolution_clock::now();
 	std::vector<std::vector<unsigned char>> merged(num_processes);  // Merge each list of char arrays into one char array
 	for (int j = 0; j < num_processes; j++) {
 		if (numberOfRegionsToSendToRank[j] > 0) {
@@ -177,6 +213,7 @@ std::tuple<std::vector<CommunicationPartner>, std::vector<CommunicationPartner>>
 			merged[j] = std::move(mergedRegions);
 		}
 	}
+	counters["acquireNeighbors - merge char arrays"] += std::chrono::duration_cast<std::chrono::nanoseconds>(std::chrono::high_resolution_clock::now() - start).count();
 
 	// We cannot know how many regions we are going to receive from each process a-priori.
 	// So we need to figure this out.
@@ -202,11 +239,13 @@ std::tuple<std::vector<CommunicationPartner>, std::vector<CommunicationPartner>>
 	 * In this case, process 0 will send 3 regions to process 2 and process 1 will send 2 regions to process 3.
 	 * After the Allreduce step every process has the information how many regions it will receive.
 	 */
+	start = std::chrono::high_resolution_clock::now();
 	std::vector<int> numberOfRegionsToReceive(num_processes, 0);  // how many bytes does each process expect?
 	MPI_Allreduce(numberOfRegionsToSendToRank.data(), numberOfRegionsToReceive.data(), num_processes, MPI_INT, MPI_SUM, comm);
+	counters["acquireNeighbors - count regions to send (Allreduce)"] += std::chrono::duration_cast<std::chrono::nanoseconds>(std::chrono::high_resolution_clock::now() - start).count();
 
 	// all the information for the final information exchange has been collected -> final exchange
-
+	start = std::chrono::high_resolution_clock::now();
 	std::vector<MPI_Request> requests(num_processes, MPI_REQUEST_NULL);
 	MPI_Status probe_status;
 	MPI_Status rec_status;
@@ -218,7 +257,9 @@ std::tuple<std::vector<CommunicationPartner>, std::vector<CommunicationPartner>>
 					  &requests[j]);  // tag is one
 		}
 	}
+	counters["acquireNeighbors - send regions to send (Isend)"] += std::chrono::duration_cast<std::chrono::nanoseconds>(std::chrono::high_resolution_clock::now() - start).count();
 
+	start = std::chrono::high_resolution_clock::now();
 	std::vector<CommunicationPartner> comm_partners01;  // the communication partners
 
 	// receive data (blocking)
@@ -265,14 +306,18 @@ std::tuple<std::vector<CommunicationPartner>, std::vector<CommunicationPartner>>
 										 region.offset, enlarged);
 		}
 	}
+	counters["acquireNeighbors - receive regions and deserialize (Get_count, Recv)"] += std::chrono::duration_cast<std::chrono::nanoseconds>(std::chrono::high_resolution_clock::now() - start).count();
 
+	start = std::chrono::high_resolution_clock::now();
 	// ensure that all sends have been finished.
 	for (int j = 0; j < num_processes; j++) {
 		if (numberOfRegionsToSendToRank[j] > 0) MPI_Wait(&requests[j], MPI_STATUS_IGNORE);
 	}
+	counters["acquireNeighbors - synchronize (Wait)"] += std::chrono::duration_cast<std::chrono::nanoseconds>(std::chrono::high_resolution_clock::now() - start).count();
 
-	// barrier for safety.
+	start = std::chrono::high_resolution_clock::now();
 	MPI_Barrier(comm);
+	counters["acquireNeighbors - barrier (Barrier)"] += std::chrono::duration_cast<std::chrono::nanoseconds>(std::chrono::high_resolution_clock::now() - start).count();
 
 	return std::make_tuple(squeezePartners(comm_partners01), squeezePartners(comm_partners02));
 }
diff --git a/src/parallel/NeighborAcquirer.h b/src/parallel/NeighborAcquirer.h
index 0f92e35ce..2f4f6056a 100644
--- a/src/parallel/NeighborAcquirer.h
+++ b/src/parallel/NeighborAcquirer.h
@@ -34,6 +34,8 @@ public:
 
 	static std::vector<CommunicationPartner> squeezePartners(const std::vector<CommunicationPartner>& partners);
 
+	static void dumpCounters();
+
 private:
 	static bool isIncluded(HaloRegion* myRegion, HaloRegion* inQuestion);
 
@@ -51,5 +53,7 @@ private:
 	static std::pair<std::vector<HaloRegion>, std::vector<std::array<double, 3>>> getPotentiallyShiftedRegions(
 							 const std::array<double, 3>& domainLength, const HaloRegion& region);
 
+	static std::map<std::string, long> counters;
+
     friend class NeighborAcquirerTest;
 };
diff --git a/src/parallel/NeighbourCommunicationScheme.cpp b/src/parallel/NeighbourCommunicationScheme.cpp
index ab4190d75..6b7b49b53 100644
--- a/src/parallel/NeighbourCommunicationScheme.cpp
+++ b/src/parallel/NeighbourCommunicationScheme.cpp
@@ -52,6 +52,14 @@ NeighbourCommunicationScheme::NeighbourCommunicationScheme(
 }
 
 NeighbourCommunicationScheme::~NeighbourCommunicationScheme() {
+	int myRank{};
+	MPI_Comm_rank(MPI_COMM_WORLD, &myRank);
+
+	std::ofstream logFile{"countersNeighbourCom_r" + std::to_string(myRank)};
+	for (const auto &[name, value] : counters) {
+		logFile << name << " = " << value << std::endl;
+	}
+	NeighborAcquirer::dumpCounters();
 	if (_pushPull) {
 		delete _haloExportForceImportNeighbours;
 		delete _haloImportForceExportNeighbours;
@@ -433,8 +441,20 @@ void NeighbourCommunicationScheme::selectNeighbours(MessageType msgType, bool im
 
 void DirectNeighbourCommunicationScheme::initCommunicationPartners(double cutoffRadius, Domain * domain,
 		DomainDecompMPIBase* domainDecomp, ParticleContainer* moleculeContainer) {
+	if (counters.empty()) {
+		counters["initArrays"] = 0;
+		counters["clearVectors"] = 0;
+		counters["ownHalo"] = 0;
+		counters["getHaloSize"] = 0;
+		counters["getHaloImportForceExportRegions"] = 0;
+		counters["getLeavingExportRegions"] = 0;
+		counters["globalDomainLength"] = 0;
+		counters["acquireNeighborsHalo"] = 0;
+		counters["acquireNeighborsLeaving"] = 0;
+	}
 	// corners of the process-specific domain
 	static_assert(DIMgeom == 3); // The initialization here assumes 3 dimensions!
+	auto start = std::chrono::high_resolution_clock::now();
 	const std::array<double, DIMgeom> localLowerCorner{
 		domainDecomp->getBoundingBoxMin(0, domain),
 		domainDecomp->getBoundingBoxMin(1, domain),
@@ -447,37 +467,59 @@ void DirectNeighbourCommunicationScheme::initCommunicationPartners(double cutoff
 	};
 
 	if (_pushPull) {
+		start = std::chrono::high_resolution_clock::now();
 		for (unsigned int d = 0; d < _commDimms; d++) { // why free?
 			(*_haloExportForceImportNeighbours)[d].clear();
 			(*_haloImportForceExportNeighbours)[d].clear();
 			(*_leavingExportNeighbours)[d].clear();
 			(*_leavingImportNeighbours)[d].clear();
 		}
+		counters["initArrays"] += std::chrono::duration_cast<std::chrono::nanoseconds>(std::chrono::high_resolution_clock::now() - start).count();
+
 	} else {
 		for (unsigned int d = 0; d < _commDimms; d++) { // why free?
 			(*_neighbours)[d].clear();
 		}
 	}
 
+	start = std::chrono::high_resolution_clock::now();
 	HaloRegion ownRegion = {localLowerCorner[0], localLowerCorner[1], localLowerCorner[2], localUpperCorner[0], localUpperCorner[1], localUpperCorner[2], 0, 0, 0, cutoffRadius};
+	counters["ownRegion"] += std::chrono::duration_cast<std::chrono::nanoseconds>(std::chrono::high_resolution_clock::now() - start).count();
 
 	if (_pushPull) {
+		start = std::chrono::high_resolution_clock::now();
 		double* const cellLength = moleculeContainer->getHaloSize();
+		counters["getHaloSize"] += std::chrono::duration_cast<std::chrono::nanoseconds>(std::chrono::high_resolution_clock::now() - start).count();
+
 		// halo/force regions
+		start = std::chrono::high_resolution_clock::now();
 		std::vector<HaloRegion> haloOrForceRegions =
 			_zonalMethod->getHaloImportForceExportRegions(ownRegion, cutoffRadius, _coversWholeDomain, cellLength);
+		counters["getHaloImportForceExportRegions"] += std::chrono::duration_cast<std::chrono::nanoseconds>(std::chrono::high_resolution_clock::now() - start).count();
+
+		start = std::chrono::high_resolution_clock::now();
 		std::vector<HaloRegion> leavingRegions =
 			_zonalMethod->getLeavingExportRegions(ownRegion, cutoffRadius, _coversWholeDomain);
+		counters["getLeavingExportRegions"] += std::chrono::duration_cast<std::chrono::nanoseconds>(std::chrono::high_resolution_clock::now() - start).count();
 
+		start = std::chrono::high_resolution_clock::now();
 		const std::array<double, 3> globalDomainLength{domain->getGlobalLength(0), domain->getGlobalLength(1),
 												 domain->getGlobalLength(2)};
+		counters["globalDomainLength"] += std::chrono::duration_cast<std::chrono::nanoseconds>(std::chrono::high_resolution_clock::now() - start).count();
+
 		// assuming p1 sends regions to p2
+		start = std::chrono::high_resolution_clock::now();
 		std::tie((*_haloImportForceExportNeighbours)[0], (*_haloExportForceImportNeighbours)[0]) =
 			NeighborAcquirer::acquireNeighbors(globalDomainLength, &ownRegion, haloOrForceRegions,
 											   domainDecomp->getCommunicator(), _useSequentialFallback);
+		counters["acquireNeighborsHalo"] += std::chrono::duration_cast<std::chrono::nanoseconds>(std::chrono::high_resolution_clock::now() - start).count();
+
 		// p1 notes reply, p2 notes owned as haloExportForceImport
+		start = std::chrono::high_resolution_clock::now();
 		std::tie((*_leavingExportNeighbours)[0], (*_leavingImportNeighbours)[0]) = NeighborAcquirer::acquireNeighbors(
 			globalDomainLength, &ownRegion, leavingRegions, domainDecomp->getCommunicator(), _useSequentialFallback);
+		counters["acquireNeighborsLeaving"] += std::chrono::duration_cast<std::chrono::nanoseconds>(std::chrono::high_resolution_clock::now() - start).count();
+
 		// p1 notes reply, p2 notes owned as leaving import
 
 	} else {
diff --git a/src/parallel/NeighbourCommunicationScheme.h b/src/parallel/NeighbourCommunicationScheme.h
index a9c3d5ebc..b32171a3f 100644
--- a/src/parallel/NeighbourCommunicationScheme.h
+++ b/src/parallel/NeighbourCommunicationScheme.h
@@ -123,6 +123,10 @@ protected:
 	bool _pushPull;
 
 	bool _useSequentialFallback{true};
+
+public:
+	std::map<std::string, long> counters{};
+
 };
 
 class DirectNeighbourCommunicationScheme: public NeighbourCommunicationScheme {

the consistent highest count is GeneralDomainDecomposition::balanceAndExchange() and in there the call to DomainDecompMPIBase::exchangeMoleculesMPI() in the path not rebalance -> sendLeavingWithCopies() == true. This, however, is expected, since this is called in every non-rebalance iteration.
The second highest is quite consistently the call to initCommPartners in the rebalance branch. When divided by the number of invocations the call to initCommPartners is several times more expensive.

The suspected offender is the repeated all-to-all communication in acquireNeighbors() (esp.MPI_Allgather(&num_bytes_send, 1, MPI_INT, num_bytes_receive_vec.data(), 1, MPI_INT, comm);) which is not called when using DomainDecomposition.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working load balancing performance Performance enhancements
Projects
None yet
Development

No branches or pull requests

2 participants