Feature: Device messages by bechols97 · Pull Request #1832 · llnl/RAJA

bechols97 · 2025-04-28T21:35:43Z

Summary

This PR adds a feature for device messages (i.e. store function arguments on device to be handled by a host callback at a later time). The original idea for the feature is to have better error handling on the device with handling any output on with a host callback. This moves the code from Camp to RAJA due dependency on atomic operations.

Design review

For the design, there are some open questions. (regarding these open questions please see the Design notes below)

Currently, this design supports a MPSC model with the expectation that the device side will produce messages while a single host thread consumes messages.
Additionally, before consuming messages, the current implementation forces the stream to synchronize.
If there are more messages than the buffer size, then those messages are lost. This is to avoid waits on the device side. However, would it be beneficial to have a configurable option to use a circular buffer?

Design notes

Based on discussion from a meeting:

As suggested in the meeting, there should be policies for the message queue to support different use cases and allows the message queue to be extended if there are any future use cases.
Based on current use cases, the default policy of the message queue should support a MPSC model, wait all should synchronize the stream, and not storing additional messages when the buffer is full.
The message_handler class that stores the callback should be move-only to prevent accidental copies to a lambda with supporting a view-like queue to copy to device kernels (as mentioned below by @MrBurmark).

Additional design notes

Based on further discussions from a meeting:

Ability to get range of messages (so that user can sort)
Ability to unsubscribe a callback
~~Ability to bind args for a callback, such as source location or strings.~~
- If moving to C++20, std::bind_front can be used. However, if needed, this can be added in a future PR
~~Store number of missed messages~~
- Posting messages returns a bool. Users can use this to get number of missed messages.
~~Remove duplicate messages~~
- This would likely be a separate queue policy, which can be added in a future PR
Move interface to something similar to the Reducer interface, where kernel/forall loop gives correct resource.

- This moves message container to RAJA to avoid dependency issues with required atomic operation. Currently, testing and waiting for messages will block until the stream is synchronized.

test/unit/messages/test-messages.cpp

MrBurmark · 2025-04-30T15:05:29Z

I think the design would benefit from a view like object that can be used in device kernels.

artv3 · 2025-05-11T17:54:39Z

@bechols97 , can you add an example in the examples folder? I have a use case where this may be handy. In my application if a thread gets a negative value, I want to take note and output it at the end of the kernel by the root rank. Currently I am using printf and every thread that encounters the negative value spews information onto the screen. To double check, would this be a use case?

bechols97 · 2025-05-12T15:59:02Z

Hi @artv3, yes that could be a use case for this. I will add an example of something similar to the examples folder.

- This allows the message queue to be passed to RAJA kernels

- This also allows the message queue to be allocated with pinned memory when needed

- Currently, example requires XNACK with HIP. (message queue should be using pinned memory so need to look into this)

rcarson3 · 2025-05-15T19:58:50Z

@bechols97 so one of the libraries I maintain has a failure macro that on the host side throw an error with some useful error messages associated with it which can provide useful context for users / developers for why something failed. Do you think we could emulate something like with this framework if we could provide the absolute max size of the char array / string literal that we'd like and then at the error site have that passed into your message class?

MrBurmark · 2025-05-15T20:20:33Z

I think it would be possible to add a fixed capacity string-like object that could be passed through this interface.

bechols97 · 2025-05-15T21:00:16Z

@rcarson3 Yes, the original intention behind this idea is to be used as a device side error handler (though left to be more generic in case there are other use cases). As @MrBurmark mentioned, it would be possible to use a fixed-string object with the message handler. For string literals, you would want some type of fixed-string object / char array to store the string that is later passed to a host side callback.

In addition to the fixed-string object, any type that is trivially destructible should work as well.

MrBurmark · 2025-06-05T17:00:39Z

The current state looks good if you're trying to handle a single kind of error.

RAJA::resources::Host res{};

auto logger = RAJA::message_handler<void(int*, int, int)>(num_messages, res, 
  [](int* ptr, int idx, int value) {
    std::cout << "\n pointer " << ptr << " a[" << idx << "] = " << value << "\n";
  }
);

auto cpu_msg_queue = logger.get_queue<RAJA::mpsc_queue>();
RAJA::forall<RAJA::seq_exec>(host, RAJA::RangeSegment(0, N), [=] (int i) { 
  if (a[i] < 0) { 
    cpu_msg_queue.try_post_message(a, i, a[i]); 
  }
});

logger.wait_all();

MrBurmark · 2025-06-05T17:27:36Z

There are use cases where we might want to handle multiple kinds of errors each with different data in the same loop. Does anyone else have such a use case? What do you think of a slightly more general interface that looks more like this?

By moving the signature to the queues we don't know the message sizes upfront. So I'm not sure if it makes sense to do the sizing upfront or later when we know what kinds of messages are possible.

RAJA::resources::Host res{};

auto logger = RAJA::message_handler(res, num_bytes);

auto queue1 = logger.get_queue<RAJA::mpsc_queue, void(int*, int, int)>(num_messages, host, 
  [](int idx, int* a, int val_a) {
    std::cout << "\n Oh no! a{" << a << "}[" << idx << "] = " << val_a<< "\n";
  }
);

auto queue2 = logger.get_queue<RAJA::mpsc_queue, void(int*, int, int)>(num_messages, host, 
  [](int idx, int* a, int val_a, double* b, double val_b) {
    std::cout << "\n Inconceivable! a{" << a << "}[" << idx << "] = " << val_a <<
                             " and  b{" << b << "}[" << idx << "] = " << val_b << "\n";
  }
);

RAJA::forall<RAJA::seq_exec>(host, RAJA::RangeSegment(0, N), [=] (int i) { 
  if (a[i] < 0) { 
    queue1.try_post_message(i, a, a[i]); 
  }
  if (a[i] == 0 && b[i] < 0) { 
    queue2.try_post_message(i, a, a[i], b, b[i]); 
  }
});

logger.wait_all();

MrBurmark · 2025-06-05T17:54:12Z

Another use case that I'm considering is having a long lived logger with an allocation. Then I can enqueue multiple types of messages in that while keeping the gpu running and check for messages occasionally to avoid extra synchronizes.
I am not sure this is feasible however as most of our logging use cases involve catching an error and stopping. If we did continue running we would likely encounter a hard error like a seg fault later.

bechols97 · 2025-06-05T20:16:57Z

Being able to support multiple error/logging messages within the same loop is definitely a use case that we would want to support. This is something that the library I help maintain uses.

There are a couple of concerns with moving the callback to be a parameter of the get_queue function:

The callback is currently stored as a std::function in message_handler. Moving the callback to the get_queue the callback could still technically be stored in message_handler, but this would likely require additional virtual functions, which could increase the overhead on the host side. This seems reasonable since for the most part since the main use cases are for debugging/error handling where the callbacks aren't expected to be called often.
If the callback is stored in msg_queue, then queue1 and queue2 now have ownership of the std::function. This would then be copied to the RAJA kernel and likely run into compilation issues the device execution spaces.

Just to show another option with the current interface: (Please note this example is not entirely the same and requires some additional types to be created; however, one could create a type similar to std::inplace_vactor<.., 5> and std::variant to store up to 5 arguments with better type safety):

union msg_arg {
  int i;
  double d;
};

enum msg_type
{
  MSG_INT,
  MSG_DLB
};

RAJA::resources::Host res{};

auto logger = RAJA::message_handler<void(msg_type, int, void*, msg_arg)>(res, num_msg, 
  [] (msg_type type, int idx, void* ptr, msg_arg val) {
    if (type == msg_type::MSG_INT) {
      std::cout << "\n Oh no! a{" <<ptr << "}[" << idx << "] = " << val.i << "\n";
    } else if (type == msg_type::MSG_DBL) {
      std::cout << "\n Inconceivable!  b{" << ptr << "}[" << idx << "] = " << val.d << "\n";
    }
});
auto queue = logger.get_queue<RAJA::mpsc_queue>();

RAJA::forall<RAJA::seq_exec>(host, RAJA::RangeSegment(0, N), [=] (int i) { 
  if (a[i] < 0) { 
    queue.try_post_message(msg_type::MSG_INT, i, a, a[i]); 
  }
  if (a[i] == 0 && b[i] < 0) { 
    queue.try_post_message(msg_type::MSG_DBL, i, b, b[i]); 
  }
});

logger.wait_all()

examples/messages-forall.cpp

artv3 · 2025-11-05T00:02:52Z

@bechols97 , can we bring this up to date?

- Also, fixes examples to not use stream 2 memory in stream 1

- This allows messages to be sorted, filtered, etc.

…LLNL/RAJA into feature/bechols97/device_messages

include/RAJA/policy/msg_queue/mpsc_queue.hpp

- Moves internal `queue` class to be public - Moves typedef back to struct for message args

examples/messages-forall.cpp

include/RAJA/pattern/messages/msg_header.hpp

artv3 · 2026-03-11T21:20:51Z

examples/messages-forall.cpp

+
+  std::cout << "\n Running RAJA omp_parallel_for_static_exec (default chunksize) vector addition...\n";
+
+  RAJA::forall<RAJA::omp_parallel_for_static_exec< >>(host, RAJA::RangeSegment(0, N),


I think you may have extra < > in the policy?

oh I see, it has a default chunk size that you can modify:

RAJA/include/RAJA/policy/openmp/policy.hpp

Lines 242 to 244 in 1d2ed09

template<int ChunkSize = default_chunk_size>

using omp_parallel_for_static_exec =

omp_parallel_exec<omp_for_schedule_exec<omp::Static<ChunkSize>>>;

The chunk size doesn't really matter in this example, So, if you prefer having an explicit value in the < > to be more clear, then I don't mind updating this.

artv3 · 2026-03-11T21:23:31Z

examples/messages-forall.cpp

+#if defined(RAJA_ENABLE_CUDA)
+  RAJA::resources::Cuda res_gpu1;
+  RAJA::resources::Cuda res_gpu2;
+  using EXEC_POLICY = RAJA::cuda_exec_async<GPU_BLOCK_SIZE>;
+#elif defined(RAJA_ENABLE_HIP)
+  RAJA::resources::Hip res_gpu1;
+  RAJA::resources::Hip res_gpu2;
+  using EXEC_POLICY = RAJA::hip_exec_async<GPU_BLOCK_SIZE>;
+#elif defined(RAJA_ENABLE_SYCL)
+  RAJA::resources::Sycl res_gpu1;
+  RAJA::resources::Sycl res_gpu2;
+  using EXEC_POLICY = RAJA::sycl_exec<GPU_BLOCK_SIZE>;


I think you can simplify this by templating the resource on policy type. See:

RAJA/examples/forall-jit.cpp

Line 60 in 1d2ed09

auto res = RAJA::resources::get_default_resource<policy>();

In the purpose of this example, the hope was to show that this feature is able to work with non-default resources as well as work with multiple GPU resources. However, there are two examples showing the multiple non-default resources. If preferred, I can update the first example to only use the default stream (since most use cases will likely be using RAJA's default stream)?

include/RAJA/pattern/messages/msg_header.hpp

johnbowen42 · 2026-03-31T20:58:57Z

include/RAJA/pattern/messages/msg_callback.hpp

+};
+
+template<typename Fn>
+struct get_signature;


We should move these into a generic function_signature_helper.hpp header. See https://github.com/llnl/RAJA/pull/1949/changes#diff-72533564b1cbd49c320bfd7981489ca5e5a08143353955c70c3ef535e38fc4ccR56. Arturo recently added similar methods for deducing the index of template parameters. These types of metaprogramming utilities are broadly useful and should live in a more visible header

It would be nice to move both to a generic header, maybe in internal. I think it's good to consolidate stuff like this so we don't end up re-implementing the methods when we need it. For example we have type_trait helper headers like https://github.com/llnl/RAJA/blob/develop/include/RAJA/pattern/kernel/TypeTraits.hpp

Is there a preferred location for these types of metaprogramming utitlies? Would https://github.com/llnl/RAJA/tree/develop/include/RAJA/util be a good place for the more generic header?

I think so yes, see for example https://github.com/llnl/RAJA/blob/develop/include/RAJA/util/EnableIf.hpp is already there. I would try to move Arturo's helpers from github.com/llnl/RAJA/pull/1949/changes#diff-72533564b1cbd49c320bfd7981489ca5e5a08143353955c70c3ef535e38fc4ccR56 there as well, and name it FunctionSignatureUtil.hpp or something

johnbowen42 · 2026-03-31T21:04:23Z

include/RAJA/pattern/messages/msg_manager.hpp

+  ~message_bus() { reset(); }
+
+  // Copy ctor/operator
+  message_bus(const message_bus&)            = delete;


I think we might also need message_bus (message_bus) = delete;

johnbowen42 · 2026-03-31T21:07:54Z

include/RAJA/pattern/messages/msg_manager.hpp

+/// Currently, this forces a synchronize prior to calling
+/// the callback function or testing if there are any messages.
+///
+class message_manager


Do we want this case for the new classes @llnl/raja-core ? kernel is all the MessageManager naming convention for classes

johnbowen42 · 2026-03-31T21:09:50Z

include/RAJA/pattern/messages/msg_manager.hpp

+  template<typename Callable>
+  void subscribe(msg_id id, Callable&& c)
+  {
+    auto callback = RAJA::msg_callback {std::forward<Callable>(c)};


it might be better to use an explicit constructor here

johnbowen42 · 2026-03-31T21:22:37Z

include/RAJA/pattern/messages/msg_manager.hpp

+  {
+    auto& fn_list = m_callback_map.at(id);
+    auto it = std::find_if(fn_list.begin(), fn_list.end(), [](const auto& fn) {
+      return typeid(Callable).hash_code() == fn->hash();


I don't know if this is how we want to be hashing values. For one, I think it's possible to have hashing collisions here, to find_if could just match the first of several possible matches. I think it might be better to hash together the msg_id with std::type_index, and just make m_callback_map a std::unordered_map<HashCode, vector<msg_callback_t>>

johnbowen42 · 2026-03-31T21:34:09Z

include/RAJA/policy/msg_queue/mpsc_queue.hpp

+        size_type new_sz = old_sz + msg_sz;
+        local_sz         = old_sz;  // offset to start of message
+        // Checks if fits in queue
+        if (new_sz <= m_container->m_capacity)


do we ever resize the m_container? this could be a race condition if so

The mpsc_queue and spsc_queue are view-like containers to the owning version of message_bus. All resizing type of operations on message_bus will end up forcing the resource in message_bus to synchronize prior to resizing along with ignoring any messages that are currently stored.

artv3 · 2026-04-02T22:11:00Z

examples/messages-forall.cpp

+#include "RAJA/util/resource.hpp"
+
+/*
+ *  Vector Addition Example


Maybe we can rename it to "RAJA::messages example" and update the description

bechols97 added 2 commits April 28, 2025 13:53

Moves message handling to RAJA from Camp

91812b9

- This moves message container to RAJA to avoid dependency issues with required atomic operation. Currently, testing and waiting for messages will block until the stream is synchronized.

Adds unit tests for message handler

c88179c

MrBurmark reviewed Apr 29, 2025

View reviewed changes

test/unit/messages/test-messages.cpp Outdated Show resolved Hide resolved

bechols97 added 4 commits May 12, 2025 09:04

Adds additional assert's to tests

0096451

Initial setup for queue view

028dd3b

- This allows the message queue to be passed to RAJA kernels

Updates to separate functionality

41172d8

- This also allows the message queue to be allocated with pinned memory when needed

Adds example and some comments

78aaa3d

- Currently, example requires XNACK with HIP. (message queue should be using pinned memory so need to look into this)

bechols97 added 2 commits May 27, 2025 10:50

Re-formats messaging code

9b7d615

Fixes bug in example

8a6d4b2

bechols97 requested a review from a team June 3, 2025 20:10

bechols97 added 7 commits July 21, 2025 13:56

Merge branch 'develop' into feature/bechols97/device_messages

94114f8

Adds factory functions and deduction guides for message_handler

3c1a73f

Runs clang-format

9f57022

Adds spsc queue policy

b875e20

Adds a fixed sized string to example

3c0fb38

Fixes formatting with clang-format

53a2a70

Adds factory function to use default resource with execution policy

13abf53

rcarson3 reviewed Aug 13, 2025

View reviewed changes

examples/messages-forall.cpp Show resolved Hide resolved

rhornung67 marked this pull request as ready for review September 30, 2025 16:56

bechols97 added 5 commits January 20, 2026 13:42

Removes extra size member variable

f388bcc

- Also, fixes examples to not use stream 2 memory in stream 1

Supports getting a container of all messages

c13d501

- This allows messages to be sorted, filtered, etc.

Merge branch 'feature/bechols97/device_messages' of ssh://github.com/…

a5688f7

…LLNL/RAJA into feature/bechols97/device_messages

Merge branch 'develop' into feature/bechols97/device_messages

3d6a37a

Formats code

6c6776c

MrBurmark reviewed Feb 17, 2026

View reviewed changes

include/RAJA/policy/msg_queue/mpsc_queue.hpp Outdated Show resolved Hide resolved

bechols97 added 4 commits February 26, 2026 16:29

Merge branch 'develop' into feature/bechols97/device_messages

ae2e1e3

Fixes for cuda build and formatting

919ac58

- Moves internal `queue` class to be public - Moves typedef back to struct for message args

Moves to CAS loop for atomic operation

b808d23

Fixes variable name

6a3196b

MrBurmark reviewed Mar 10, 2026

View reviewed changes

examples/messages-forall.cpp Outdated Show resolved Hide resolved

MrBurmark reviewed Mar 10, 2026

View reviewed changes

include/RAJA/pattern/messages/msg_header.hpp Show resolved Hide resolved

bechols97 added 8 commits March 10, 2026 14:52

Moves message headers to pattern folder and makes format

3b322c4

Removes simd from example since not support with messages

9d60d99

Destroys message header after use

f6d4e76

Merge branch 'develop' into feature/bechols97/device_messages

5f4b397

Fixes include path

faae0ff

Fixes warnings

bc1ca1b

Cleans up warnings

04445c0

Runs make style

87da925

artv3 reviewed Mar 11, 2026

View reviewed changes

include/RAJA/pattern/messages/msg_header.hpp Outdated Show resolved Hide resolved

bechols97 and others added 4 commits March 16, 2026 14:47

Moves align to be under util directory and renames to align_sz

1241e57

Moves to use align_sz

7ff9dc2

Adds missing header files

3c916fe

Merge branch 'develop' into feature/bechols97/device_messages

809a09e

bechols97 requested a review from a team March 31, 2026 15:59

johnbowen42 reviewed Apr 1, 2026

View reviewed changes

artv3 reviewed Apr 2, 2026

View reviewed changes


		std::cout << "\n Running RAJA omp_parallel_for_static_exec (default chunksize) vector addition...\n";

		RAJA::forall<RAJA::omp_parallel_for_static_exec< >>(host, RAJA::RangeSegment(0, N),

	template<int ChunkSize = default_chunk_size>
	using omp_parallel_for_static_exec =
	omp_parallel_exec<omp_for_schedule_exec<omp::Static<ChunkSize>>>;

Conversation

bechols97 commented Apr 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Design review

Design notes

Additional design notes

Uh oh!

Uh oh!

MrBurmark commented Apr 30, 2025

Uh oh!

artv3 commented May 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bechols97 commented May 12, 2025

Uh oh!

rcarson3 commented May 15, 2025

Uh oh!

MrBurmark commented May 15, 2025

Uh oh!

bechols97 commented May 15, 2025

Uh oh!

MrBurmark commented Jun 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

MrBurmark commented Jun 5, 2025

Uh oh!

MrBurmark commented Jun 5, 2025

Uh oh!

bechols97 commented Jun 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

artv3 commented Nov 5, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

bechols97 commented Apr 28, 2025 •

edited

Loading

artv3 commented May 11, 2025 •

edited

Loading

MrBurmark commented Jun 5, 2025 •

edited

Loading

bechols97 commented Jun 5, 2025 •

edited

Loading