Skip to content

Conversation

@msimberg
Copy link
Contributor

@msimberg msimberg commented Jan 28, 2026

Adds gtfn_gpu and dace_gpu backends to the distributed CI pipeline.

The base image is upgraded because it's possible, but not strictly necessary. The CPU-only version of the pipeline needed 25.04 (24.04 and 25.10 did not work for various reasons). However, since OpenMPI and libfabric are now built manually in the container the base image version is less of a constraint. 24.04 doesn't have matching GCC/CUDA versions and 26.04 doesn't exist yet, but the pipeline should eventually use 26.04.

OpenMPI and libfabric are built manually for slingshot support because getting the ubuntu repository packages to work with GPU support did not seem possible/easy. The installation is based on https://github.com/eth-cscs/cray-network-stack.

GHEX needs an upgrade, because there's a bug in how strides are calculated for GPU buffers. @philip-paul-mueller has already fixed this in ghex-org/GHEX#190 but we should wait for that to be merged (and probably test in icon-exclaim first).

This also fixes a few cupy/numpy incompatibilities. revert_repeated_index_to_invalid was updated to only deal with numpy for now as the connectivities are always numpy arrays. test_halo_exchange_for_sparse_field is marked embedded_only. The non-MPI test was already marked embedded-only.

This does not try to unify the default and distributed CI pipeline definitions. That should, however, be done done sooner or later as well.

@msimberg
Copy link
Contributor Author

cscs-ci run distributed

@msimberg msimberg force-pushed the distributed-tests-dace-gpu branch from f310db9 to fb3927e Compare January 28, 2026 15:04
@msimberg msimberg force-pushed the distributed-tests-dace-gpu branch from fb3927e to 1fd6389 Compare January 28, 2026 15:16
@msimberg
Copy link
Contributor Author

cscs-ci run distributed

1 similar comment
@msimberg
Copy link
Contributor Author

cscs-ci run distributed

@msimberg
Copy link
Contributor Author

cscs-ci run distributed

@msimberg
Copy link
Contributor Author

cscs-ci run distributed

@msimberg
Copy link
Contributor Author

cscs-ci run distributed

@msimberg
Copy link
Contributor Author

cscs-ci run distributed

1 similar comment
@msimberg
Copy link
Contributor Author

cscs-ci run distributed

@msimberg msimberg force-pushed the distributed-tests-dace-gpu branch from ab4ac8f to 8f04d36 Compare January 29, 2026 13:38
@msimberg
Copy link
Contributor Author

cscs-ci run distributed

@msimberg
Copy link
Contributor Author

cscs-ci run distributed

@msimberg
Copy link
Contributor Author

cscs-ci run distributed

@msimberg
Copy link
Contributor Author

cscs-ci run distributed

@msimberg
Copy link
Contributor Author

msimberg commented Feb 3, 2026

cscs-ci run distributed

@msimberg
Copy link
Contributor Author

msimberg commented Feb 3, 2026

cscs-ci run distributed

1 similar comment
@msimberg
Copy link
Contributor Author

msimberg commented Feb 3, 2026

cscs-ci run distributed

@msimberg
Copy link
Contributor Author

msimberg commented Feb 3, 2026

cscs-ci run distributed

@msimberg
Copy link
Contributor Author

msimberg commented Feb 4, 2026

cscs-ci run distributed

[tool.uv.sources]
dace = {index = "gridtools"}
ghex = {git = "https://github.com/msimberg/GHEX.git", branch = "async-mpi"}
ghex = {git = "https://github.com/philip-paul-mueller/GHEX.git", branch = "phimuell__async-mpi-2"}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is updated because ghex-org/GHEX#190 contains a bugfix to how strides are computed for GPU buffers. Tests fail with master and async-mpi. We should get ghex-org/GHEX#190 merged ASAP to be able to use GHEX master here.

@msimberg
Copy link
Contributor Author

msimberg commented Feb 4, 2026

cscs-ci run distributed

@msimberg
Copy link
Contributor Author

msimberg commented Feb 5, 2026

This is ready for reviews, but not ready for merging due to the GHEX update.

@msimberg msimberg marked this pull request as ready for review February 5, 2026 08:43
@jcanton
Copy link
Contributor

jcanton commented Feb 5, 2026

sorry ;-)

@github-actions
Copy link

github-actions bot commented Feb 5, 2026

Mandatory Tests

Please make sure you run these tests via comment before you merge!

  • cscs-ci run default
  • cscs-ci run distributed

Optional Tests

To run benchmarks you can use:

  • cscs-ci run benchmark-bencher

To run tests and benchmarks with the DaCe backend you can use:

  • cscs-ci run dace

To run test levels ignored by the default test suite (mostly simple datatest for static fields computations) you can use:

  • cscs-ci run extra

For more detailed information please look at CI in the EXCLAIM universe.

@jcanton
Copy link
Contributor

jcanton commented Feb 5, 2026

cscs-ci run distributed

@jcanton
Copy link
Contributor

jcanton commented Feb 5, 2026

cscs-ci run default

@msimberg msimberg marked this pull request as draft February 9, 2026 08:57
@msimberg
Copy link
Contributor Author

msimberg commented Feb 9, 2026

I've marked this a draft until #980 is merged. It should update the ghex commit to a new enough commit for this PR as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants