Question
Does using NVSHMEMI_THREADGROUP_THREAD as the default scope cause excessive redundant work? Specifically, when nvshmem_fence() is called from a warp or block, all threads execute nvshmemi_ibgda_fence(), each seeing index_in_scope == 0 and scope_size == 1, and thus redundantly iterating over all DCIs and RC QPs to issue ibgda_quiet(qp) calls. Could this lead to significant performance overhead?