Why does cuda::atomic::store(memory_order_seq_cst) generate a relaxed store instead of a release store? #3827

admbbs · 2025-02-16T05:48:52Z

admbbs
Feb 16, 2025

I notice that cccl implements cuda::atomic::store(memory_order_seq_cst) with a fence.sc followed by a relaxed store. We may prove it through this code snippet.

But the ASPLOS_2019 PTX Memory Model paper states in section 4.2 that a release store is necessary:

One particular mapping required extra attention: .release annotations are not redundant with a leading fence.sc, even though they may seem to be.

Are there any new developments on this, or it is a implementation error?

Thanks a lot!

——————————
update: i just find that a subsequent store release in the same thread is not treated as part of the release sequence anymore.

https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2018/p0982r1.html

does this have anything to do with this topic?

Answered by gonzalobg

Feb 18, 2025

Very good question!

First, libcu++ atomics currently rely on implementation details which, in currently supported platforms, enable libcu++ to lower:

sequentially-consistent stores to fence.sc; st.relaxed; instead of fence.sc; st.release;.
sequentially-consistent rmws to fence.sc; atom.acquire; instead of fence.sc; atom.acq_rel;.
libc++ is closely tied to the implementation (CUDA Toolkit, compiler, driver, hw) and if the above changes, we'll update it accordingly.

Second, you are totally right that the current expansion is not correct according to the model published in the ASPLOS ’19 paper, or the PTX Atomics ABI which is what we require external SW to follow. We’ve actually considered…

View full answer

gonzalobg · 2025-02-18T13:43:49Z

gonzalobg
Feb 18, 2025
Collaborator

Very good question!

First, libcu++ atomics currently rely on implementation details which, in currently supported platforms, enable libcu++ to lower:

sequentially-consistent stores to fence.sc; st.relaxed; instead of fence.sc; st.release;.
sequentially-consistent rmws to fence.sc; atom.acquire; instead of fence.sc; atom.acq_rel;.
libc++ is closely tied to the implementation (CUDA Toolkit, compiler, driver, hw) and if the above changes, we'll update it accordingly.

Second, you are totally right that the current expansion is not correct according to the model published in the ASPLOS ’19 paper, or the PTX Atomics ABI which is what we require external SW to follow. We’ve actually considered this a bug in the ASPLOS ’19 memory model and the ABI for a while, and although we haven’t gotten to it yet, we intend to update the model formalism to reflect the fact that the mapping with the relaxed store is sound in practice.

1 reply

admbbs Feb 18, 2025
Author

Thanks for this marvelous answer. It looks that GPU has got an atomic operations mapping which quietly resembles the one POWER has.

https://www.cl.cam.ac.uk/%7Epes20/cpp/cpp0xmappings.html

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Why does cuda::atomic::store(memory_order_seq_cst) generate a relaxed store instead of a release store? #3827

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 1 reply

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Why does cuda::atomic::store(memory_order_seq_cst) generate a relaxed store instead of a release store? #3827

admbbs Feb 16, 2025

Replies: 1 comment · 1 reply

gonzalobg Feb 18, 2025 Collaborator

admbbs Feb 18, 2025 Author

admbbs
Feb 16, 2025

Replies: 1 comment 1 reply

gonzalobg
Feb 18, 2025
Collaborator

admbbs Feb 18, 2025
Author