Skip to content

Conversation

@halungge
Copy link
Contributor

@halungge halungge commented Sep 6, 2024

Decompose (global) grid file:

  • uses pymetis to decompose the global grid (cells) into n patches
  • after decomposition halos for all dimensions (cell, edge, vertex) are constructed. Halo construction is done in a ICON like fashion: They consist halos of 2 cell levels (one upward and one downward pointing line) and the corresponding vertices and edges on these lines.

Omissions:

  • LAM grids need to be investigated further:

    • tests comparing decomposed vs. single_node computation are only run on the global grid.
    • for the LAM grids ICON reorders arrays to arrange halo points on the first boundary layers together with the boundary layers, it should be investigated whether that is essential in the model.
    • This PR does only take this into account on the computation of the start_index and end_index not in the halo construction.
  • the number of halo lines (in terms of cells) is hardcoded to 2, that could be made a parameter.

  • Not sure it all runs on GPU correctly... most probably there are some numpy cupy issues to fix.

Magdalena Luz added 30 commits September 3, 2024 10:37
- module level log
- add builder for decomposition, remove from constructor
- start/end index add workaround for global INTERIOR
remove dtype arg from int fields as it is always the same
# Conflicts:
#	model/common/pyproject.toml
#	model/common/src/icon4py/model/common/decomposition/mpi_decomposition.py
#	model/common/src/icon4py/model/common/dimension.py
#	model/common/src/icon4py/model/common/grid/grid_manager.py
#	model/common/tests/grid_tests/test_grid_manager.py
#	model/testing/src/icon4py/model/testing/datatest_utils.py
#	model/testing/src/icon4py/model/testing/serialbox.py
#	tools/src/icon4py/tools/common/metadata.py
#	tools/src/icon4py/tools/icon4pygen/bindings/locations.py
# Conflicts:
#	model/common/pyproject.toml
@msimberg
Copy link
Contributor

cscs-ci run default

@msimberg
Copy link
Contributor

cscs-ci run distributed

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR introduces domain decomposition of ICON grid files using METIS (pymetis) and constructs ICON-like halo regions for distributed runs. It also refactors grid loading so GridManager can build distributed/local grids (including refinement-domain bounds) and expands MPI-focused test coverage.

Changes:

  • Add METIS-based decomposition and ICON-like halo construction for cell/edge/vertex dimensions.
  • Refactor grid file reading/index transformations and update grid/config/decomposition APIs (horizontal_size, DecompositionInfo.set_dimension, etc.).
  • Add/adjust MPI tests and update existing tests/utilities for the new decomposition/halo behavior.

Reviewed changes

Copilot reviewed 47 out of 49 changed files in this pull request and generated 39 comments.

Show a summary per file
File Description
uv.lock Adds mpi4py/pymetis to lockfile entries.
tach.toml Excludes pymetis from tach scanning.
pyproject.toml Adds mpi4py to typing extras.
model/common/pyproject.toml Adds pymetis to io extra.
tools/src/icon4py/tools/py2fgen/wrappers/common.py Updates grid/decomposition wrapper construction for new APIs.
model/testing/src/icon4py/model/testing/serialbox.py Updates savepoint decomposition/grid construction to new APIs.
model/testing/src/icon4py/model/testing/parallel_helpers.py Simplifies logging helpers; updates decomposition info size reporting.
model/testing/src/icon4py/model/testing/grid_utils.py Switches to GridManager-provided decomposition_info; updates GridManager init.
model/testing/src/icon4py/model/testing/fixtures/benchmark.py Uses grid_manager.decomposition_info instead of constructing it.
model/testing/src/icon4py/model/testing/definitions.py Adds a new torus grid descriptor.
model/standalone_driver/src/icon4py/model/standalone_driver/driver_utils.py Updates GridManager creation signature (but needs follow-up fixes).
model/common/tests/common/metrics/mpi_tests/test_parallel_metrics.py Skips MPI tests when optional MPI deps are missing.
model/common/tests/common/interpolation/mpi_tests/test_parallel_interpolation.py Skips MPI tests when optional MPI deps are missing.
model/common/tests/common/io/unit_tests/test_writers.py Updates horizontal_confighorizontal_size.
model/common/tests/common/io/unit_tests/test_io.py Updates horizontal_confighorizontal_size.
model/common/tests/common/grid/unit_tests/test_icon.py Improves skip-value assertion messages.
model/common/tests/common/grid/unit_tests/test_horizontal.py Adds test for max_boundary_level.
model/common/tests/common/grid/unit_tests/test_gridfile.py Updates GridFile API usage; adds selective read tests.
model/common/tests/common/grid/unit_tests/test_grid_refinement.py Updates domain-bounds tests to use decomposition info; adds more LAM cases.
model/common/tests/common/grid/unit_tests/test_grid_manager.py Adds tests for distributed connectivities and decomposition info; updates GridManager usage.
model/common/tests/common/grid/fixtures.py Introduces shared fixtures module for grid tests.
model/common/tests/common/grid/mpi_tests/utils.py Adds helpers to run grid manager in single/multi-rank modes.
model/common/tests/common/grid/mpi_tests/test_parallel_icon.py Adds distributed-grid skip-value test; refactors existing MPI grid tests.
model/common/tests/common/grid/mpi_tests/test_parallel_grid_refinement.py Adds MPI test for compute_domain_bounds (contains an xfail bug).
model/common/tests/common/grid/mpi_tests/test_parallel_grid_manager.py Adds extensive single-vs-multi rank comparisons and validation tests (contains an assertion bug).
model/common/tests/common/grid/mpi_tests/test_parallel_geometry.py Skips when MPI deps missing; minor refactor.
model/common/tests/common/decomposition/fixtures.py Adds fixture for simple-grid neighbor tables.
model/common/tests/common/decomposition/utils.py Adds simple-grid decomposition/halo reference data and dummy props.
model/common/tests/common/decomposition/unit_tests/test_halo.py Adds unit tests for halo construction and mapping.
model/common/tests/common/decomposition/unit_tests/test_definitions.py Adds tests for new DecompositionInfo behavior (halo masks, distributed detection).
model/common/tests/common/decomposition/mpi_tests/test_parallel_halo.py Adds MPI test ensuring unique ownership across ranks.
model/common/tests/common/decomposition/mpi_tests/test_mpi_decomposition.py Expands MPI decomposition tests; logging tweaks; adds halo-level mask tests.
model/common/src/icon4py/model/common/utils/data_allocation.py Adds array_ns_from_array helper (numpy vs cupy selection).
model/common/src/icon4py/model/common/interpolation/stencils/compute_cell_2_vertex_interpolation.py Updates GT4Py typing annotations for dims.
model/common/src/icon4py/model/common/interpolation/interpolation_factory.py Adjusts provider registration call for nudgecoeffs.
model/common/src/icon4py/model/common/grid/simple.py Updates GridConfig argument name to horizontal_size.
model/common/src/icon4py/model/common/grid/icon.py Adjusts skip-value logic to treat distributed like limited-area.
model/common/src/icon4py/model/common/grid/horizontal.py Adds ordered/halo-domain helpers; adds max_boundary_level; doc updates.
model/common/src/icon4py/model/common/grid/gridfile.py Adds transformation protocol, dynamic/fixed dimensions, selective reads, context manager.
model/common/src/icon4py/model/common/grid/grid_refinement.py Computes domain bounds using decomposition info and halo placement rules.
model/common/src/icon4py/model/common/grid/grid_manager.py Major refactor: decomposed grid construction, local reading, halo integration.
model/common/src/icon4py/model/common/grid/base.py Renames GridConfig field to horizontal_size; uses shared MissingConnectivityError.
model/common/src/icon4py/model/common/exceptions.py Adds ValidationError and MissingConnectivityError; typing improvements.
model/common/src/icon4py/model/common/decomposition/mpi_decomposition.py Updates mpi4py import handling; field slicing based on decomposition info indices.
model/common/src/icon4py/model/common/decomposition/definitions.py Refactors DecompositionInfo to require halo levels and adds halo helpers/flags.
model/common/src/icon4py/model/common/decomposition/decomposer.py Adds METIS and single-node decomposers.
model/common/src/icon4py/model/common/decomposition/halo.py Adds ICON-like halo constructor and global→local connectivity mapping.
model/atmosphere/dycore/tests/dycore/integration_tests/test_benchmark_solve_nonhydro.py Switches owner_mask source to geometry field provider.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +239 to +243
definitions.DecompositionInfo()
# TODO (halungge): last argument is called `decomp_domain` in icon, it is not needed in the granules should we pass it nevertheless?
.set_dimension(dims.CellDim, c_glb_index, c_owner_mask, None)
.set_dimension(dims.EdgeDim, e_glb_index, e_owner_mask, None)
.set_dimension(dims.VertexDim, v_glb_index, v_owner_mask, None)
Copy link

Copilot AI Feb 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DecompositionInfo.set_dimension(...) now requires a halo_levels array; passing None will break later calls like is_distributed(), halo_levels(), and halo_level_mask() (and any exchange logic relying on them). Please pass a properly-sized int array (e.g., OWNED for all entries if halo info is unavailable) instead of None.

Copilot uses AI. Check for mistakes.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this would mean changing the fortran interface to pass decomp_domain, so I'd leave this as None for now.

We should check that icon blueline still works correctly and doesn't try to access the Nones before merging though, so I'll leave this comment unresolved until then.

@msimberg
Copy link
Contributor

cscs-ci run default

Comment on lines 21 to 43
# TODO(msimberg): Single rank, not node.
def run_grid_manager_for_singlenode(file: pathlib.Path) -> gm.GridManager:
manager = _grid_manager(file, NUM_LEVELS)
manager(
keep_skip_values=True,
run_properties=decomp_defs.SingleNodeProcessProperties(),
decomposer=decomp.SingleNodeDecomposer(),
allocator=None,
)
return manager


# TODO(msimberg): Fix typos. Consistent naming with above function.
def run_gridmananger_for_multinode(
file: pathlib.Path,
run_properties: decomp_defs.ProcessProperties,
decomposer: decomp.Decomposer,
) -> gm.GridManager:
manager = _grid_manager(file, NUM_LEVELS)
manager(
keep_skip_values=True, allocator=None, run_properties=run_properties, decomposer=decomposer
)
return manager
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've renamed these multi_rank and single_rank, but for bikeshedding purposes we could consider just calling them distributed and local. I'm not super convinced by the latter, but distributed is very established and we already use the term in some places.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We also have a number of other utilities that still use single/multinode that should be renamed but I'm not renaming them in this PR. This renaming is already on https://hackmd.io/WljE8xX3STmJL4VkqYpqQw?both.

@github-actions
Copy link

Mandatory Tests

Please make sure you run these tests via comment before you merge!

  • cscs-ci run default
  • cscs-ci run distributed

Optional Tests

To run benchmarks you can use:

  • cscs-ci run benchmark-bencher

To run tests and benchmarks with the DaCe backend you can use:

  • cscs-ci run dace

To run test levels ignored by the default test suite (mostly simple datatest for static fields computations) you can use:

  • cscs-ci run extra

For more detailed information please look at CI in the EXCLAIM universe.

Comment on lines +194 to +204
# TODO(msimberg): What should we do about this. (The global) num_cells is
# not guaranteed to be set here when used through fortran. Should we:
# 1. Ignore distributed?
# 2. Compute num_cells with a reduction?
# 3. Use a ProcessProperties to detect it?
distributed = (
config.num_cells < global_properties.num_cells
if global_properties.num_cells is not None
else False
)
limited_area_or_distributed = config.limited_area or distributed
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would be good to resolve before this PR is merged.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants