-
Notifications
You must be signed in to change notification settings - Fork 189
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allocating multiple blocks to one mpi rank in LB #5026
base: python
Are you sure you want to change the base?
Conversation
… blocks pre mpi rank
if (blocks_per_mpi_rank != Utils::Vector3i{{1, 1, 1}}) { | ||
throw std::runtime_error( | ||
"GPU architecture PROHIBITED allocating many blocks to 1 CPU."); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"Using more than one block per MPI rank is not supported for GPU LB" (but why, actually?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how about "GPU LB only uses 1 block per MPI rank"?
@@ -96,7 +96,7 @@ class BoundaryPackInfo : public PackInfo<GhostLayerField_T> { | |||
WALBERLA_ASSERT_EQUAL(bSize, buf_size); | |||
#endif | |||
|
|||
auto const offset = std::get<0>(m_lattice->get_local_grid_range()); | |||
auto const offset = to_vector3i(receiver->getAABB().min()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wouldn't it be better to have functions for this in the Lattice class, so they can be used by EK as well?
After all, LB and EK probably need to agree about both, the mpi and the block decompositoin?
} | ||
|
||
auto constexpr lattice_constant = real_t{1}; | ||
auto const cells_block = Utils::hadamard_division(grid_dimensions, node_grid); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cells_per_block?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I changed cells_block to cells_per_block.
// number of cells per block in each direction | ||
uint_c(cells_block[0]), uint_c(cells_block[1]), uint_c(cells_block[2]), | ||
lattice_constant, | ||
// number of cpus per direction | ||
uint_c(node_grid[0]), uint_c(node_grid[1]), uint_c(node_grid[2]), | ||
// periodicity | ||
true, true, true); | ||
true, true, true, | ||
// keep global block information |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does this do/mean?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If "keep global block information" is true, each process keeps information about remote blocks that reside on other processes.
for (auto y = lower_cell.y(); y <= upper_cell.y(); ++y) { | ||
for (auto z = lower_cell.z(); z <= upper_cell.z(); ++z) { | ||
auto const node = local_offset + Utils::Vector3i{{x, y, z}}; | ||
auto const index = stride_x * (node[0] - lower_corner[0]) + |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can these index calculations be moved out to separate functions? It looks relateed to Utils::get_linear_index()?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I rewrote the code, obtaining a linear index using Utils::get_linear_index() and refactoring the code related to set_slice and get_slice.
lbm::accessor::Velocity::set(pdf_field, vel_field, force_field, values, | ||
*ci); | ||
int64_t const stride_y = (ci->max().z() - ci->min().z() + 1u); | ||
int64_t const stride_x = (ci->max().y() - ci->min().y() + 1u) * stride_y; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pls document wht these strides are for
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By refactoring the code, stride is not used now.
lbm::accessor::Velocity::set(pdf_field, vel_field, force_field, values, | ||
*ci); | ||
int64_t const stride_y = (ci->max().z() - ci->min().z() + 1u); | ||
int64_t const stride_x = (ci->max().y() - ci->min().y() + 1u) * stride_y; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, these stride calculation seem to appear several times in the cell interval code. Can that be ousourced to a function which is then re-used?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I rewrote the code following your suggestions. Thus, the code related to set_slice and get_slice became more compact.
return v * u | ||
|
||
|
||
LB_PARAMS = {'agrid': 1., |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pls avoid 1s in unit tests, as wrong exponents don't get cauth.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I moved this function.
Thank you. Looks good in general. |
I rewrote the code related to the getters and setters for slices. Similar loops are pulled into a function which calls different lambdas for the individual cases. |
|
||
""" | ||
Benchmark Lattice-Boltzmann fluid + Lennard-Jones particles. | ||
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wouldn't it be more sustainable if we only had one LB benchmark file to maintain? argparse is quite flexible, surely we can come up with a way to select strong vs. weak scaling with command line options?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I unified lb.py and lb_weakscaling.py by adding option --weak_scaling to argparse for lb.py
auto const blocks_per_mpi_rank = get_value_or<Utils::Vector3i>( | ||
params, "blocks_per_mpi_rank", Utils::Vector3i{{1, 1, 1}}); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here and elsewhere, do we really need a default value, considering we already provide a default value in the python class?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added default value for the python class LatticeWalberla
. And blocks_per_mpi_rank
in LBFluid
is settle by get_value<Utils::Vector3i>(m_lattice->get_parameter("blocks_per_mpi_rank"))
.
if (blocks_per_mpi_rank != Utils::Vector3i{{1, 1, 1}}) { | ||
throw std::runtime_error( | ||
"GPU architecture PROHIBITED allocating many blocks to 1 CPU."); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how about "GPU LB only uses 1 block per MPI rank"?
return Utils::Vector3i{ | ||
{static_cast<int>(v[0]), static_cast<int>(v[1]), static_cast<int>(v[2])}}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are these static casts safe? For example, what is the output when the input vector is v = {1.f, 2.f, 6.99999f}
? What happens if a vector of doubles is passed? Should we warn the user through assertions if the input floats are not "close enough" to round numbers?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added the function for a vector of doubles and assertions if the difference between input values and round numbers is larger than 10^(-5), which is also debatable.
@@ -41,7 +41,7 @@ class LBMassCommon: | |||
|
|||
"""Check the lattice-Boltzmann mass conservation.""" | |||
|
|||
system = espressomd.System(box_l=[3.0, 3.0, 3.0]) | |||
system = espressomd.System(box_l=[6.0, 6.0, 6.0]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Several mass conservation tests are know to take a lot of time, especially in code coverage builds. The runtime can significantly increase when multiple CI jobs run on the same runner, which is then starved of resources. This change may potentially increase the test runtime by a factor 8. Can you please confirm the runtime did not significantly change in the clang and coverage CI jobs, compared to the python branch?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I changed the box_l from 6 to 4 to reduce the runtime. For testing with blocks_per_mpir_rank(i.e. [1,1,2]), box_l = 4 is needed.
if (upper_corner[0] < block_lower_corner[0] or | ||
upper_corner[1] < block_lower_corner[1] or | ||
upper_corner[2] < block_lower_corner[2]) { | ||
return std::nullopt; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the past, ESPResSo was affected by subtle bugs where indices in the compared vectors wouldn't be the same on the left-hand-side and right-and-side of the comparison operator. To avoid this type of mistake, one can write here and elsewhere if (not(block_lower_corner > upper_corner))
. Comparison operators on Utils::Vector
objects will internally call Utils::detail::all_of(lhs, rhs, cmp)
, which is why sometimes the lhs and rhs have to be exchanged.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have rewritten it, as you suggest.
Cell const block_lower_cell = | ||
Cell(static_cast<int>(block_lower_corner[0] - block_offset[0]), | ||
static_cast<int>(block_lower_corner[1] - block_offset[1]), | ||
static_cast<int>(block_lower_corner[2] - block_offset[2])); | ||
Cell const block_upper_cell = | ||
Cell(static_cast<int>(block_upper_corner[0] - block_offset[0]), | ||
static_cast<int>(block_upper_corner[1] - block_offset[1]), | ||
static_cast<int>(block_upper_corner[2] - block_offset[2])); | ||
return {CellInterval(block_lower_cell, block_upper_cell)}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cell const
->auto const
since the rhs already indicates the type.- are the static casts necessary? the result of the operations seem to already be
int
- would
return {{block_lower_cell, block_upper_cell}};
work here? I never remember the exact rule, but I think the outermost braces should invoke the optional constructor, and the innermost braces should invoke the cell interval constructor. For initializer lists and pointer-valued element different rules would apply.
Description of changes:
blocks_per_mpi_rank
forLBFluidWalberla