Do halo exchanges with NCCL by msimberg · Pull Request #185 · ghex-org/GHEX

msimberg · 2025-10-27T11:02:20Z

Opening this just for reference. This first version was just a proof of concept to see what improvement we might see.

This version hacks in NCCL support in communication_object. Performance is very good intra-node in icon-exclaim. I haven't tested inter-node performance (which would use libfabric; the icon uenv is still using an old version of NCCL and libfabric which may not perform so well).

I plan to work on updating this to a cleaner version with an oomph NCCL backend. However, this may need some new customization points etc. as the NCCL communication has to/can be set up a bit differently compared to the MPI/libfabric implementations (pack + send/recv + unpack can all be scheduled in one go).

Based on #184.

msimberg · 2025-10-27T11:03:19Z

cmake/ghex_external_dependencies.cmake

+# ---------------------------------------------------------------------
+# nccl setup
+# ---------------------------------------------------------------------
+if(GHEX_USE_NCCL)
+  link_libraries("-lnccl")
+  # include_directories("")
+endif()


This is a hack. Add proper FindNCCL.cmake module.

This can be tested using the icon uenv manually setting export LIBRARY_PATH=/user-environment/env/default/lib64:/user-environment/env/default/lib.

msimberg · 2025-10-27T11:03:35Z

include/ghex/device/cuda/stream.hpp

+struct cuda_event {
+    cudaEvent_t m_event;
+    ghex::util::moved_bit m_moved;
+
+    cuda_event() {
+        GHEX_CHECK_CUDA_RESULT(cudaEventCreateWithFlags(&m_event, cudaEventDisableTiming))
+    }
+    cuda_event(const cuda_event&) = delete;
+    cuda_event& operator=(const cuda_event&) = delete;
+    cuda_event(cuda_event&& other) = default;
+    cuda_event& operator=(cuda_event&&) = default;
+
+    ~cuda_event()
+    {
+        if (!m_moved)
+        {
+            GHEX_CHECK_CUDA_RESULT_NO_THROW(cudaEventDestroy(m_event))
+        }
+    }
+
+    operator bool() const noexcept { return m_moved; }
+    operator cudaEvent_t() const noexcept { return m_event; }
+    cudaEvent_t&       get() noexcept { return m_event; }
+    const cudaEvent_t& get() const noexcept { return m_event; }
+};
+


Separate file.

msimberg · 2025-10-27T11:04:01Z

include/ghex/communication_object.hpp

+    ghex::util::moved_bit          m_moved;
    bool                           m_valid;
    communicator_type              m_comm;
+#ifdef GHEX_USE_NCCL
+    ncclComm_t m_nccl_comm;
+#endif


Move this to oomph.

msimberg · 2025-10-27T11:04:28Z

include/ghex/communication_object.hpp

+  private:
+    template<typename... Archs, typename... Fields>
+    void nccl_exchange_impl(buffer_info_type<Archs, Fields>... buffer_infos) {
+      pack_nccl();
+
+      ncclGroupStart();
+      post_sends_nccl();
+      post_recvs_nccl();
+      ncclGroupEnd();
+
+      unpack_nccl();
+    }


Add customization point or similar to allow doing this with NCCL?

msimberg · 2025-10-27T11:05:06Z

test/unstructured/test_user_concepts.cpp

    // application data
    auto& d = local_domains[0];
    ghex::test::util::memory<int> field(d.size()*levels, 0);
+#ifndef GHEX_USE_NCCL


To do: tests etc. don't like NCCL when using host memory. Copy to device when using NCCL or disallow it completely?

msimberg · 2025-10-27T11:05:51Z

test/unstructured/test_user_concepts.cpp

    auto h_gpu = co.exchange(patterns(data_gpu));
+#ifdef GHEX_USE_NCCL
+    cudaDeviceSynchronize();
+#endif
    h_gpu.wait();


Introduce another API that allows skipping a blocking wait.

msimberg · 2025-12-22T14:58:03Z

I'm keeping this branch around for use with icon4py, but closing this PR by replacing it with a cleaner version.

msimberg and others added 10 commits October 22, 2025 12:14

Update oomph for hwmalloc heap-config branch

a4f5499

Update oomph submodules

4746b35

Refactor pack/unpack kernels

d26c05a

Try 1d block for pack/unpack

99fe0a0

Add dumb nccl implementation

527d590

Add back cuda event class

78879bb

Add TODO for nccl in cmake

f314a1c

Clean up nccl parts

ab0dfd0

Small fix to stream syncing with nccl

4b5833f

Update test to disable cpu exchange with nccl

ee1b851

msimberg commented Oct 27, 2025

View reviewed changes

msimberg mentioned this pull request Nov 3, 2025

Asynchronously schedule halo exchange with communication_object #186

Closed

msimberg closed this Dec 22, 2025

msimberg mentioned this pull request Dec 22, 2025

Add support for NCCL in communication_object #195

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Do halo exchanges with NCCL#185

Do halo exchanges with NCCL#185
msimberg wants to merge 10 commits intoghex-org:masterfrom
msimberg:nccl-fewer-hacks

msimberg commented Oct 27, 2025

Uh oh!

msimberg Oct 27, 2025

Uh oh!

msimberg Oct 27, 2025

Uh oh!

msimberg Oct 27, 2025

Uh oh!

msimberg Oct 27, 2025

Uh oh!

msimberg Oct 27, 2025

Uh oh!

msimberg Oct 27, 2025

Uh oh!

msimberg commented Dec 22, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

msimberg commented Oct 27, 2025

Uh oh!

msimberg Oct 27, 2025

Choose a reason for hiding this comment

Uh oh!

msimberg Oct 27, 2025

Choose a reason for hiding this comment

Uh oh!

msimberg Oct 27, 2025

Choose a reason for hiding this comment

Uh oh!

msimberg Oct 27, 2025

Choose a reason for hiding this comment

Uh oh!

msimberg Oct 27, 2025

Choose a reason for hiding this comment

Uh oh!

msimberg Oct 27, 2025

Choose a reason for hiding this comment

Uh oh!

msimberg commented Dec 22, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant