Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allocating multiple blocks to one mpi rank in LB #5026

Draft
wants to merge 23 commits into
base: python
Choose a base branch
from

Conversation

hidekb
Copy link

@hidekb hidekb commented Jan 10, 2025

Description of changes:

  • LB CPU now supports allocating multiple blocks to one mpi rank
    • The default block number per mpi rank is 1
    • The block number per mpi rank is controlled by the argument blocks_per_mpi_rank for LBFluidWalberla

@hidekb hidekb marked this pull request as draft January 10, 2025 18:01
maintainer/benchmarks/lb.py Outdated Show resolved Hide resolved
if (blocks_per_mpi_rank != Utils::Vector3i{{1, 1, 1}}) {
throw std::runtime_error(
"GPU architecture PROHIBITED allocating many blocks to 1 CPU.");
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Using more than one block per MPI rank is not supported for GPU LB" (but why, actually?)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how about "GPU LB only uses 1 block per MPI rank"?

@@ -96,7 +96,7 @@ class BoundaryPackInfo : public PackInfo<GhostLayerField_T> {
WALBERLA_ASSERT_EQUAL(bSize, buf_size);
#endif

auto const offset = std::get<0>(m_lattice->get_local_grid_range());
auto const offset = to_vector3i(receiver->getAABB().min());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't it be better to have functions for this in the Lattice class, so they can be used by EK as well?
After all, LB and EK probably need to agree about both, the mpi and the block decompositoin?

}

auto constexpr lattice_constant = real_t{1};
auto const cells_block = Utils::hadamard_division(grid_dimensions, node_grid);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cells_per_block?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed cells_block to cells_per_block.

// number of cells per block in each direction
uint_c(cells_block[0]), uint_c(cells_block[1]), uint_c(cells_block[2]),
lattice_constant,
// number of cpus per direction
uint_c(node_grid[0]), uint_c(node_grid[1]), uint_c(node_grid[2]),
// periodicity
true, true, true);
true, true, true,
// keep global block information
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does this do/mean?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If "keep global block information" is true, each process keeps information about remote blocks that reside on other processes.

for (auto y = lower_cell.y(); y <= upper_cell.y(); ++y) {
for (auto z = lower_cell.z(); z <= upper_cell.z(); ++z) {
auto const node = local_offset + Utils::Vector3i{{x, y, z}};
auto const index = stride_x * (node[0] - lower_corner[0]) +
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can these index calculations be moved out to separate functions? It looks relateed to Utils::get_linear_index()?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I rewrote the code, obtaining a linear index using Utils::get_linear_index() and refactoring the code related to set_slice and get_slice.

lbm::accessor::Velocity::set(pdf_field, vel_field, force_field, values,
*ci);
int64_t const stride_y = (ci->max().z() - ci->min().z() + 1u);
int64_t const stride_x = (ci->max().y() - ci->min().y() + 1u) * stride_y;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pls document wht these strides are for

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By refactoring the code, stride is not used now.

lbm::accessor::Velocity::set(pdf_field, vel_field, force_field, values,
*ci);
int64_t const stride_y = (ci->max().z() - ci->min().z() + 1u);
int64_t const stride_x = (ci->max().y() - ci->min().y() + 1u) * stride_y;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, these stride calculation seem to appear several times in the cell interval code. Can that be ousourced to a function which is then re-used?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I rewrote the code following your suggestions. Thus, the code related to set_slice and get_slice became more compact.

return v * u


LB_PARAMS = {'agrid': 1.,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pls avoid 1s in unit tests, as wrong exponents don't get cauth.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I moved this function.

@RudolfWeeber
Copy link
Contributor

Thank you. Looks good in general.
One thing worth looking into is the cell interval business. It looks like there are a lot of very similar loops in the getters and setters for slices. Would it it be possible to pull that out into a function, which then is called with different lambdas for the individual cases?
@jngrad could you maybe take a look?

@hidekb
Copy link
Author

hidekb commented Jan 15, 2025

Thank you. Looks good in general. One thing worth looking into is the cell interval business. It looks like there are a lot of very similar loops in the getters and setters for slices. Would it it be possible to pull that out into a function, which then is called with different lambdas for the individual cases?

I rewrote the code related to the getters and setters for slices. Similar loops are pulled into a function which calls different lambdas for the individual cases.

maintainer/benchmarks/lb.py Outdated Show resolved Hide resolved

"""
Benchmark Lattice-Boltzmann fluid + Lennard-Jones particles.
"""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't it be more sustainable if we only had one LB benchmark file to maintain? argparse is quite flexible, surely we can come up with a way to select strong vs. weak scaling with command line options?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I unified lb.py and lb_weakscaling.py by adding option --weak_scaling to argparse for lb.py

src/python/espressomd/lb.py Outdated Show resolved Hide resolved
Comment on lines 142 to 143
auto const blocks_per_mpi_rank = get_value_or<Utils::Vector3i>(
params, "blocks_per_mpi_rank", Utils::Vector3i{{1, 1, 1}});
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here and elsewhere, do we really need a default value, considering we already provide a default value in the python class?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added default value for the python class LatticeWalberla. And blocks_per_mpi_rank in LBFluid is settle by get_value<Utils::Vector3i>(m_lattice->get_parameter("blocks_per_mpi_rank")).

if (blocks_per_mpi_rank != Utils::Vector3i{{1, 1, 1}}) {
throw std::runtime_error(
"GPU architecture PROHIBITED allocating many blocks to 1 CPU.");
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how about "GPU LB only uses 1 block per MPI rank"?

Comment on lines +72 to +73
return Utils::Vector3i{
{static_cast<int>(v[0]), static_cast<int>(v[1]), static_cast<int>(v[2])}};
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are these static casts safe? For example, what is the output when the input vector is v = {1.f, 2.f, 6.99999f}? What happens if a vector of doubles is passed? Should we warn the user through assertions if the input floats are not "close enough" to round numbers?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added the function for a vector of doubles and assertions if the difference between input values and round numbers is larger than 10^(-5), which is also debatable.

testsuite/python/lb.py Outdated Show resolved Hide resolved
testsuite/python/lb_couette_xy.py Outdated Show resolved Hide resolved
@@ -41,7 +41,7 @@ class LBMassCommon:

"""Check the lattice-Boltzmann mass conservation."""

system = espressomd.System(box_l=[3.0, 3.0, 3.0])
system = espressomd.System(box_l=[6.0, 6.0, 6.0])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Several mass conservation tests are know to take a lot of time, especially in code coverage builds. The runtime can significantly increase when multiple CI jobs run on the same runner, which is then starved of resources. This change may potentially increase the test runtime by a factor 8. Can you please confirm the runtime did not significantly change in the clang and coverage CI jobs, compared to the python branch?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed the box_l from 6 to 4 to reduce the runtime. For testing with blocks_per_mpir_rank(i.e. [1,1,2]), box_l = 4 is needed.

testsuite/python/lb_shear.py Outdated Show resolved Hide resolved
Comment on lines 413 to 417
if (upper_corner[0] < block_lower_corner[0] or
upper_corner[1] < block_lower_corner[1] or
upper_corner[2] < block_lower_corner[2]) {
return std::nullopt;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the past, ESPResSo was affected by subtle bugs where indices in the compared vectors wouldn't be the same on the left-hand-side and right-and-side of the comparison operator. To avoid this type of mistake, one can write here and elsewhere if (not(block_lower_corner > upper_corner)). Comparison operators on Utils::Vector objects will internally call Utils::detail::all_of(lhs, rhs, cmp), which is why sometimes the lhs and rhs have to be exchanged.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have rewritten it, as you suggest.

Comment on lines 431 to 439
Cell const block_lower_cell =
Cell(static_cast<int>(block_lower_corner[0] - block_offset[0]),
static_cast<int>(block_lower_corner[1] - block_offset[1]),
static_cast<int>(block_lower_corner[2] - block_offset[2]));
Cell const block_upper_cell =
Cell(static_cast<int>(block_upper_corner[0] - block_offset[0]),
static_cast<int>(block_upper_corner[1] - block_offset[1]),
static_cast<int>(block_upper_corner[2] - block_offset[2]));
return {CellInterval(block_lower_cell, block_upper_cell)};
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Cell const -> auto const since the rhs already indicates the type.
  • are the static casts necessary? the result of the operations seem to already be int
  • would return {{block_lower_cell, block_upper_cell}}; work here? I never remember the exact rule, but I think the outermost braces should invoke the optional constructor, and the innermost braces should invoke the cell interval constructor. For initializer lists and pointer-valued element different rules would apply.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants