Allocating multiple blocks to one mpi rank in LB #5026

hidekb · 2025-01-10T11:00:51Z

Description of changes:

LB CPU now supports allocating multiple blocks to one mpi rank
- The default block number per mpi rank is 1
- The block number per mpi rank is controlled by the argument blocks_per_mpi_rank for LBFluidWalberla

… blocks pre mpi rank

maintainer/benchmarks/lb.py

RudolfWeeber · 2025-01-13T14:19:33Z

src/script_interface/walberla/LBFluid.cpp

+  if (blocks_per_mpi_rank != Utils::Vector3i{{1, 1, 1}}) {
+    throw std::runtime_error(
+        "GPU architecture PROHIBITED allocating many blocks to 1 CPU.");
+  }


"Using more than one block per MPI rank is not supported for GPU LB" (but why, actually?)

how about "GPU LB only uses 1 block per MPI rank"?

RudolfWeeber · 2025-01-13T14:25:52Z

src/walberla_bridge/src/BoundaryPackInfo.hpp

@@ -96,7 +96,7 @@ class BoundaryPackInfo : public PackInfo<GhostLayerField_T> {
    WALBERLA_ASSERT_EQUAL(bSize, buf_size);
 #endif

-    auto const offset = std::get<0>(m_lattice->get_local_grid_range());
+    auto const offset = to_vector3i(receiver->getAABB().min());


Wouldn't it be better to have functions for this in the Lattice class, so they can be used by EK as well?
After all, LB and EK probably need to agree about both, the mpi and the block decompositoin?

RudolfWeeber · 2025-01-13T14:27:53Z

src/walberla_bridge/src/LatticeWalberla.cpp

  }

  auto constexpr lattice_constant = real_t{1};
-  auto const cells_block = Utils::hadamard_division(grid_dimensions, node_grid);


cells_per_block?

I changed cells_block to cells_per_block.

RudolfWeeber · 2025-01-13T14:29:03Z

src/walberla_bridge/src/LatticeWalberla.cpp

      // number of cells per block in each direction
      uint_c(cells_block[0]), uint_c(cells_block[1]), uint_c(cells_block[2]),
      lattice_constant,
      // number of cpus per direction
      uint_c(node_grid[0]), uint_c(node_grid[1]), uint_c(node_grid[2]),
      // periodicity
-      true, true, true);
+      true, true, true,
+      // keep global block information


What does this do/mean?

If "keep global block information" is true, each process keeps information about remote blocks that reside on other processes.

src/walberla_bridge/src/lattice_boltzmann/LBWalberlaImpl.hpp

RudolfWeeber · 2025-01-13T14:53:25Z

src/walberla_bridge/src/lattice_boltzmann/LBWalberlaImpl.hpp

+            for (auto y = lower_cell.y(); y <= upper_cell.y(); ++y) {
+              for (auto z = lower_cell.z(); z <= upper_cell.z(); ++z) {
+                auto const node = local_offset + Utils::Vector3i{{x, y, z}};
+                auto const index = stride_x * (node[0] - lower_corner[0]) +


Can these index calculations be moved out to separate functions? It looks relateed to Utils::get_linear_index()?

I rewrote the code, obtaining a linear index using Utils::get_linear_index() and refactoring the code related to set_slice and get_slice.

RudolfWeeber · 2025-01-13T14:56:46Z

src/walberla_bridge/src/lattice_boltzmann/LBWalberlaImpl.hpp

-      lbm::accessor::Velocity::set(pdf_field, vel_field, force_field, values,
-                                   *ci);
+      int64_t const stride_y = (ci->max().z() - ci->min().z() + 1u);
+      int64_t const stride_x = (ci->max().y() - ci->min().y() + 1u) * stride_y;


Pls document wht these strides are for

By refactoring the code, stride is not used now.

RudolfWeeber · 2025-01-13T14:58:54Z

src/walberla_bridge/src/lattice_boltzmann/LBWalberlaImpl.hpp

-      lbm::accessor::Velocity::set(pdf_field, vel_field, force_field, values,
-                                   *ci);
+      int64_t const stride_y = (ci->max().z() - ci->min().z() + 1u);
+      int64_t const stride_x = (ci->max().y() - ci->min().y() + 1u) * stride_y;


Also, these stride calculation seem to appear several times in the cell interval code. Can that be ousourced to a function which is then re-used?

I rewrote the code following your suggestions. Thus, the code related to set_slice and get_slice became more compact.

RudolfWeeber · 2025-01-13T15:07:01Z

testsuite/python/lb_couette_xy.py

+    return v * u
+
+
+LB_PARAMS = {'agrid': 1.,


Pls avoid 1s in unit tests, as wrong exponents don't get cauth.

I moved this function.

RudolfWeeber · 2025-01-13T15:15:06Z

Thank you. Looks good in general.
One thing worth looking into is the cell interval business. It looks like there are a lot of very similar loops in the getters and setters for slices. Would it it be possible to pull that out into a function, which then is called with different lambdas for the individual cases?
@jngrad could you maybe take a look?

hidekb · 2025-01-15T16:55:20Z

Thank you. Looks good in general. One thing worth looking into is the cell interval business. It looks like there are a lot of very similar loops in the getters and setters for slices. Would it it be possible to pull that out into a function, which then is called with different lambdas for the individual cases?

I rewrote the code related to the getters and setters for slices. Similar loops are pulled into a function which calls different lambdas for the individual cases.

maintainer/benchmarks/lb.py

jngrad · 2025-01-15T16:31:53Z

maintainer/benchmarks/lb_weakscaling.py

+
+"""
+Benchmark Lattice-Boltzmann fluid + Lennard-Jones particles.
+"""


Wouldn't it be more sustainable if we only had one LB benchmark file to maintain? argparse is quite flexible, surely we can come up with a way to select strong vs. weak scaling with command line options?

I unified lb.py and lb_weakscaling.py by adding option --weak_scaling to argparse for lb.py

src/python/espressomd/lb.py

jngrad · 2025-01-15T16:36:21Z

src/script_interface/walberla/LBFluid.cpp

+  auto const blocks_per_mpi_rank = get_value_or<Utils::Vector3i>(
+      params, "blocks_per_mpi_rank", Utils::Vector3i{{1, 1, 1}});


Here and elsewhere, do we really need a default value, considering we already provide a default value in the python class?

I added default value for the python class LatticeWalberla. And blocks_per_mpi_rank in LBFluid is settle by get_value<Utils::Vector3i>(m_lattice->get_parameter("blocks_per_mpi_rank")).

jngrad · 2025-01-15T16:36:47Z

src/script_interface/walberla/LBFluid.cpp

+  if (blocks_per_mpi_rank != Utils::Vector3i{{1, 1, 1}}) {
+    throw std::runtime_error(
+        "GPU architecture PROHIBITED allocating many blocks to 1 CPU.");
+  }


how about "GPU LB only uses 1 block per MPI rank"?

jngrad · 2025-01-15T16:50:02Z

src/walberla_bridge/src/utils/types_conversion.hpp

+  return Utils::Vector3i{
+      {static_cast<int>(v[0]), static_cast<int>(v[1]), static_cast<int>(v[2])}};


Are these static casts safe? For example, what is the output when the input vector is v = {1.f, 2.f, 6.99999f}? What happens if a vector of doubles is passed? Should we warn the user through assertions if the input floats are not "close enough" to round numbers?

I added the function for a vector of doubles and assertions if the difference between input values and round numbers is larger than 10^(-5), which is also debatable.

testsuite/python/lb.py

testsuite/python/lb_couette_xy.py

jngrad · 2025-01-15T16:55:19Z

testsuite/python/lb_mass_conservation.py

@@ -41,7 +41,7 @@ class LBMassCommon:

    """Check the lattice-Boltzmann mass conservation."""

-    system = espressomd.System(box_l=[3.0, 3.0, 3.0])
+    system = espressomd.System(box_l=[6.0, 6.0, 6.0])


Several mass conservation tests are know to take a lot of time, especially in code coverage builds. The runtime can significantly increase when multiple CI jobs run on the same runner, which is then starved of resources. This change may potentially increase the test runtime by a factor 8. Can you please confirm the runtime did not significantly change in the clang and coverage CI jobs, compared to the python branch?

I changed the box_l from 6 to 4 to reduce the runtime. For testing with blocks_per_mpir_rank(i.e. [1,1,2]), box_l = 4 is needed.

testsuite/python/lb_shear.py

jngrad · 2025-01-17T16:33:48Z

src/walberla_bridge/src/lattice_boltzmann/LBWalberlaImpl.hpp

+    if (upper_corner[0] < block_lower_corner[0] or
+        upper_corner[1] < block_lower_corner[1] or
+        upper_corner[2] < block_lower_corner[2]) {
+      return std::nullopt;
+    }


In the past, ESPResSo was affected by subtle bugs where indices in the compared vectors wouldn't be the same on the left-hand-side and right-and-side of the comparison operator. To avoid this type of mistake, one can write here and elsewhere if (not(block_lower_corner > upper_corner)). Comparison operators on Utils::Vector objects will internally call Utils::detail::all_of(lhs, rhs, cmp), which is why sometimes the lhs and rhs have to be exchanged.

I have rewritten it, as you suggest.

jngrad · 2025-01-17T16:42:03Z

src/walberla_bridge/src/lattice_boltzmann/LBWalberlaImpl.hpp

+    Cell const block_lower_cell =
+        Cell(static_cast<int>(block_lower_corner[0] - block_offset[0]),
+             static_cast<int>(block_lower_corner[1] - block_offset[1]),
+             static_cast<int>(block_lower_corner[2] - block_offset[2]));
+    Cell const block_upper_cell =
+        Cell(static_cast<int>(block_upper_corner[0] - block_offset[0]),
+             static_cast<int>(block_upper_corner[1] - block_offset[1]),
+             static_cast<int>(block_upper_corner[2] - block_offset[2]));
+    return {CellInterval(block_lower_cell, block_upper_cell)};


Cell const -> auto const since the rhs already indicates the type.

are the static casts necessary? the result of the operations seem to already be int

would return {{block_lower_cell, block_upper_cell}}; work here? I never remember the exact rule, but I think the outermost braces should invoke the optional constructor, and the innermost braces should invoke the cell interval constructor. For initializer lists and pointer-valued element different rules would apply.

Hideki Kobayashi and others added 10 commits November 6, 2024 12:04

Annotation for pure fluid integration

e76ecf5

Annotation for pure fluid integration 2nd

067d3fa

Allocating many blocks to mpi rank

6392e3c

Add test script about domain decomposition for LBM

9e7f3c9

Added unit_tests and python integration tests for allocating multipul…

e3ee829

… blocks pre mpi rank

Deleted unnecessary comment

0135af7

Formatting codes for allocating multiple blocks to mpi rank

0793276

Merge branch 'python' into scale-lbm

0c33a15

Formatting codes

d40edca

Formatting codes for git style

75e9e17

hidekb marked this pull request as draft January 10, 2025 18:01

hidekb added 2 commits January 10, 2025 19:11

Solve the conflict

e8d0b1e

Formatting codes and Fix benchmarks script

281abc2

RudolfWeeber reviewed Jan 13, 2025

View reviewed changes

hidekb added 4 commits January 15, 2025 15:57

Responding to Reviews

a55c6bf

Formatting codes

cb1561c

Formatting codes for clang-sanitizer

42a24e7

Fortting codes in git style

e26d439

jngrad reviewed Jan 15, 2025

View reviewed changes

hidekb added 4 commits January 17, 2025 12:38

Responding reviews

a91eaf5

Formatting codes

a509615

Fixed problems with debuging option

2d221c1

Formatting codes for clang-sanitizer

6091226

jngrad reviewed Jan 17, 2025

View reviewed changes

hidekb added 3 commits January 17, 2025 20:40

Responding to Reviews

a6bac85

Removing unneccessary comments

1b0e7c1

Formatting codes for git-style

9d9bd13

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allocating multiple blocks to one mpi rank in LB #5026

Allocating multiple blocks to one mpi rank in LB #5026

hidekb commented Jan 10, 2025 •

edited

Loading

RudolfWeeber Jan 13, 2025

jngrad Jan 15, 2025

RudolfWeeber Jan 13, 2025

RudolfWeeber Jan 13, 2025

hidekb Jan 15, 2025

RudolfWeeber Jan 13, 2025

hidekb Jan 13, 2025

RudolfWeeber Jan 13, 2025

hidekb Jan 15, 2025

RudolfWeeber Jan 13, 2025

hidekb Jan 15, 2025

RudolfWeeber Jan 13, 2025

hidekb Jan 15, 2025

RudolfWeeber Jan 13, 2025

hidekb Jan 15, 2025

RudolfWeeber commented Jan 13, 2025

hidekb commented Jan 15, 2025

jngrad Jan 15, 2025

hidekb Jan 17, 2025

jngrad Jan 15, 2025

hidekb Jan 17, 2025

jngrad Jan 15, 2025

jngrad Jan 15, 2025

hidekb Jan 17, 2025

jngrad Jan 15, 2025

hidekb Jan 17, 2025

jngrad Jan 17, 2025

hidekb Jan 17, 2025

jngrad Jan 17, 2025

		auto const blocks_per_mpi_rank = get_value_or<Utils::Vector3i>(
		params, "blocks_per_mpi_rank", Utils::Vector3i{{1, 1, 1}});

		return Utils::Vector3i{
		{static_cast<int>(v[0]), static_cast<int>(v[1]), static_cast<int>(v[2])}};

Allocating multiple blocks to one mpi rank in LB #5026

Are you sure you want to change the base?

Allocating multiple blocks to one mpi rank in LB #5026

Conversation

hidekb commented Jan 10, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

RudolfWeeber commented Jan 13, 2025

hidekb commented Jan 15, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hidekb commented Jan 10, 2025 •

edited

Loading