UCT/GDA: Collapsed CQ #10959

Artemy-Mellanox · 2025-10-20T10:02:55Z

Summary by CodeRabbit

Release Notes

Bug Fixes
- Enhanced WQE reservation mechanism with improved error recovery
Refactor
- Optimized TX completion queue configuration
- Streamlined device endpoint initialization
- Introduced utilities for improved completion event handling
- Simplified flow control checks

greptile-apps

Greptile Overview

Greptile Summary

This PR implements a "Collapsed CQ" (Completion Queue) architecture for the UCT GDA (GPU Direct Async) transport layer. The refactoring consolidates completion queue management by moving from a per-endpoint, thread-based CQE consumption model to a direct, on-demand CQE polling approach. Key changes include: (1) reducing the CQ length from tx_qp_len * UCT_IB_MLX5_MAX_BB to 1 and marking it with UCT_IB_MLX5_CQ_IGNORE_OVERRUN, (2) removing per-endpoint state tracking fields (cqe_ci, sq_wqe_pi, producer_index, avail_count) from device endpoint structures, (3) replacing the dedicated progress thread with inline CQE reading functions (uct_rc_mlx5_gda_read_cqe, uct_rc_mlx5_gda_calc_pi), and (4) introducing an optimistic WQE reservation mechanism with atomicCAS-based rollback to handle resource contention. The test infrastructure is updated to validate only ready_index instead of the removed fields. This architectural shift simplifies the code by eliminating separate consumer index tracking and reducing memory footprint, while enabling all threads to independently determine resource availability by directly querying the collapsed CQ.

Important Files Changed

Filename	Score	Overview
`src/uct/ib/mlx5/gdaki/gdaki.cuh`	2/5	Replaces avail_count tracking with direct CQE polling and adds optimistic atomicCAS-based WQE reservation with rollback; contains potential race condition and unbounded retry loop
`src/uct/ib/mlx5/gdaki/gdaki.c`	3/5	Reduces CQ length to 1, sets IGNORE_OVERRUN flag, and removes per-endpoint SQ management fields from device endpoint initialization
`src/uct/ib/mlx5/gdaki/gdaki_dev.h`	3/5	Removes per-endpoint `cqe_ci`, `sq_wqe_pi`, and `avail_count` fields; adds `cq_lock` for shared CQ synchronization
`test/gtest/ucp/cuda/test_kernels.cu`	4/5	Removes tracking of `producer_index` and `avail_count` from kernel state collection logic, keeping only `ready_index`
`test/gtest/ucp/test_ucp_device.cc`	4/5	Removes assertions for `producer_index` and `avail_count` in test validation, keeping only `ready_index` check
`test/gtest/ucp/cuda/test_kernels.h`	4/5	Removes `producer_index` and `avail_count` fields from result struct, simplifying to status and `ready_index` only

Confidence score: 2/5

This PR requires careful review due to potential race conditions and synchronization issues in the collapsed CQ implementation
Score reflects concerns about the non-atomic read at line149 of gdaki.cuh (potential race between read and atomicAdd), the unbounded retry loop in the rollback path (lines 176-184) that could cause livelock under high contention, and the removal of explicit flow control mechanisms without clear replacement guarantees. The IGNORE_OVERRUN flag indicates intentional relaxation of CQ overflow protection that needs thorough validation
Pay close attention to src/uct/ib/mlx5/gdaki/gdaki.cuh for the WQE reservation logic and synchronization correctness, and verify that the single-CQE design in src/uct/ib/mlx5/gdaki/gdaki.c handles all completion scenarios without data loss

Sequence Diagram

sequenceDiagram
    participant User
    participant Host
    participant UCT_GDAKI
    participant CUDA_Driver
    participant GPU_Kernel
    participant HW_QP as Hardware QP
    participant HW_CQ as Hardware CQ (Collapsed)

    User->>Host: Initialize GDAKI Interface
    Host->>UCT_GDAKI: uct_rc_gdaki_iface_init()
    UCT_GDAKI->>CUDA_Driver: cuDeviceGet()
    UCT_GDAKI->>CUDA_Driver: cuDevicePrimaryCtxRetain()
    UCT_GDAKI->>CUDA_Driver: cuMemAlloc() for atomic buffer
    UCT_GDAKI->>UCT_GDAKI: ibv_reg_mr() for atomic buffer
    UCT_GDAKI-->>Host: Interface ready

    User->>Host: Create Endpoint
    Host->>UCT_GDAKI: uct_rc_gdaki_ep_init()
    UCT_GDAKI->>CUDA_Driver: cuCtxPushCurrent()
    UCT_GDAKI->>CUDA_Driver: cuMemAlloc() for dev_ep (counters, CQ, WQ)
    UCT_GDAKI->>UCT_GDAKI: mlx5dv_devx_umem_reg()
    UCT_GDAKI->>UCT_GDAKI: uct_ib_mlx5_devx_create_cq_common()
    Note over UCT_GDAKI,HW_CQ: Create collapsed CQ in GPU memory
    UCT_GDAKI->>UCT_GDAKI: uct_ib_mlx5_devx_create_qp_common()
    UCT_GDAKI->>CUDA_Driver: cuMemHostRegister() for UAR
    UCT_GDAKI->>CUDA_Driver: cuMemHostGetDevicePointer() for DB
    UCT_GDAKI->>CUDA_Driver: cuMemcpyHtoD() to initialize dev_ep
    UCT_GDAKI->>CUDA_Driver: cuCtxPopCurrent()
    UCT_GDAKI-->>Host: Endpoint ready

    User->>Host: Connect Endpoint
    Host->>UCT_GDAKI: uct_rc_gdaki_ep_connect_to_ep_v2()
    UCT_GDAKI->>UCT_GDAKI: uct_rc_mlx5_iface_common_devx_connect_qp()
    UCT_GDAKI-->>Host: Connection established

    User->>Host: Launch GPU Kernel for PUT operation
    Host->>GPU_Kernel: ucp_test_kernel<level>()
    GPU_Kernel->>GPU_Kernel: uct_rc_mlx5_gda_reserv_wqe()
    Note over GPU_Kernel: Atomically reserve WQE slots
    GPU_Kernel->>GPU_Kernel: uct_rc_mlx5_gda_wqe_prepare_put_or_atomic()
    GPU_Kernel->>GPU_Kernel: doca_gpu_dev_verbs_store_wqe_seg()
    Note over GPU_Kernel: Write WQE to GPU memory
    GPU_Kernel->>GPU_Kernel: uct_rc_mlx5_gda_db()
    GPU_Kernel->>GPU_Kernel: __threadfence()
    GPU_Kernel->>GPU_Kernel: atomicCAS to update sq_ready_index
    GPU_Kernel->>HW_QP: uct_rc_mlx5_gda_ring_db()
    Note over GPU_Kernel,HW_QP: Ring doorbell via GPU-mapped UAR
    GPU_Kernel->>HW_QP: uct_rc_mlx5_gda_update_dbr()
    Note over GPU_Kernel,HW_QP: Update doorbell record

    HW_QP->>HW_QP: Process RDMA WRITE
    HW_QP->>HW_CQ: Write CQE to GPU memory
    Note over HW_CQ: CQ is in GPU-accessible memory

    GPU_Kernel->>GPU_Kernel: uct_rc_mlx5_gda_ep_check_completion()
    GPU_Kernel->>HW_CQ: uct_rc_mlx5_gda_read_cqe()
    Note over GPU_Kernel,HW_CQ: Read CQE directly from GPU
    GPU_Kernel->>GPU_Kernel: uct_rc_mlx5_gda_calc_pi()
    GPU_Kernel->>GPU_Kernel: Compare wqe_idx with completion
    GPU_Kernel-->>Host: Return completion status

    User->>Host: Cleanup Endpoint
    Host->>UCT_GDAKI: uct_rc_gdaki_ep_cleanup()
    UCT_GDAKI->>CUDA_Driver: cuMemHostUnregister()
    UCT_GDAKI->>UCT_GDAKI: uct_ib_mlx5_devx_destroy_qp_common()
    UCT_GDAKI->>UCT_GDAKI: uct_ib_mlx5_devx_destroy_cq_common()
    UCT_GDAKI->>UCT_GDAKI: mlx5dv_devx_umem_dereg()
    UCT_GDAKI->>CUDA_Driver: cuMemFree()
    UCT_GDAKI-->>Host: Cleanup complete

_{6 files reviewed, 3 comments}

_{Edit Code Review Agent Settings | Greptile}

greptile-apps · 2025-10-25T13:05:25Z

src/uct/ib/mlx5/gdaki/gdaki.cuh

+    if (ep->sq_rsvd_index > max_wqe_base) {
        return UCT_RC_GDA_RESV_WQE_NO_RESOURCE;
    }


logic: race condition: ep->sq_rsvd_index read without atomic protection could see stale value, leading to incorrect reservation checks

maybe add READ_ONCE to make sure value is not cached?
lock would bee too expensive in this case

is it fixed?
I used 2 counters approach to fix these 2 things and make the code simpler. Maybe we can still use that?

greptile-apps · 2025-10-25T13:05:26Z

src/uct/ib/mlx5/gdaki/gdaki.cuh

+    while (wqe_base > max_wqe_base) {
+        uint64_t wqe_next = wqe_base + count;
+        if (atomicCAS(reinterpret_cast<unsigned long long*>(&ep->sq_rsvd_index),
+                      wqe_next, wqe_base) == wqe_next) {
+            return UCT_RC_GDA_RESV_WQE_NO_RESOURCE;
+        }
+
+        max_wqe_base = uct_rc_mlx5_gda_max_alloc_wqe_base(ep, count);
+    }


logic: potential livelock: if max_wqe_base keeps getting updated by concurrent operations, threads could spin indefinitely in this rollback loop without bounded retry limit

max_wqe_base updated eventually will make condition wqe_base > max_wqe_base false and quit loop

greptile-apps · 2025-10-25T13:05:27Z

src/uct/ib/mlx5/gdaki/gdaki.cuh

+    uct_rc_mlx5_gda_read_cqe(ep, &wqe_cnt, NULL);
+    pi = uct_rc_mlx5_gda_calc_pi(ep, wqe_cnt);
+    return pi + ep->sq_wqe_num + 1 - count;


style: recalculating CQE data on every iteration adds overhead; consider caching if called frequently in tight loops

each iteration should see update wqe_cnt - how we can cache?

greptile-apps

Greptile Overview

Greptile Summary

This PR implements a "collapsed completion queue" (CQ) architecture for the UCX GDA (GPU Direct Async) InfiniBand transport layer. The changes eliminate the separate progress thread and shared atomic counters (avail_count, producer_index) that previously tracked send queue state on the host side. Instead, GPU device kernels now directly poll completion queue entries (CQEs) during work queue entry (WQE) reservation to compute available slots on-demand. The endpoint device structure (uct_rc_gdaki_dev_ep_t) has been streamlined by removing redundant hardware indices (sq_wqe_pi, cqe_ci, avail_count), consolidating state into reservation indices (sq_rsvd_index, sq_ready_index) and a minimal TX CQ (size 1). Test infrastructure was updated to match the new result structure that now returns only status and ready_index. This refactor reduces host-GPU synchronization overhead and memory footprint but shifts complexity into device-side atomic reservation logic with rollback loops.

Important Files Changed

Filename	Score	Overview
test/gtest/ucp/cuda/test_kernels.h	4/5	Removed `producer_index` and `avail_count` fields from kernel result structure, simplifying test API
test/gtest/ucp/test_ucp_device.cc	4/5	Removed validation of `producer_index` and `avail_count`, retaining only `ready_index` check
test/gtest/ucp/cuda/test_kernels.cu	4/5	Removed intermediate state tracking for `producer_index` and `avail_count` in kernel state capture
src/uct/ib/mlx5/gdaki/gdaki.c	3/5	Collapsed TX CQ to size 1 and removed initialization of queue tracking fields in device endpoint
src/uct/ib/mlx5/gdaki/gdaki_dev.h	3/5	Removed `sq_wqe_pi`, `cqe_ci`, and `avail_count` fields from device endpoint structure
src/uct/ib/mlx5/gdaki/gdaki.cuh	2/5	Replaced `avail_count` tracking with direct CQE polling and added atomic rollback loop for reservations

Confidence score: 2/5

This PR introduces significant architectural changes to critical GPU-device path logic that could cause silent queue corruption or livelock under race conditions
Score reflects multiple concurrency issues: non-atomic reads of sq_rsvd_index creating race conditions, unbounded rollback loops that can livelock when threads contend, and fixed CQE address reads that may yield stale data in circular buffers
Pay close attention to src/uct/ib/mlx5/gdaki/gdaki.cuh where the reservation and CQE polling logic has been rewritten, and verify the atomic semantics around sq_rsvd_index access patterns across all device-side code paths

Sequence Diagram

sequenceDiagram
    participant User

_{6 files reviewed, 2 comments}

_{Edit Code Review Agent Settings | Greptile}

greptile-apps · 2025-10-25T13:07:04Z

src/uct/ib/mlx5/gdaki/gdaki.cuh

+UCS_F_DEVICE void uct_rc_mlx5_gda_read_cqe(uct_rc_gdaki_dev_ep_t *ep,
+                                           uint16_t *wqe_cnt, uint8_t *opcode)
+{
+    auto *cqe64        = reinterpret_cast<mlx5_cqe64*>(ep->cqe_daddr);
+    uint32_t *data_ptr = (uint32_t*)&cqe64->wqe_counter;
+    uint32_t data      = READ_ONCE(*data_ptr);
+
+    *wqe_cnt = uct_rc_mlx5_gda_bswap16(data);
+    if (opcode != NULL) {
+        *opcode = data >> 28;
+    }
+}


logic: CQE always read from fixed location ep->cqe_daddr; if this is a circular buffer (typical for CQs), the code must index by cqe_ci % cqe_num, otherwise stale or incorrect CQEs will be read. Does ep->cqe_daddr always point to the current CQE (updated elsewhere), or is it a base address requiring indexing?

in collapsed CQ case the queue consist of single CQE element so it should be ok

greptile-apps · 2025-10-25T13:07:05Z

src/uct/ib/mlx5/gdaki/gdaki.cuh

 uct_rc_mlx5_gda_fc(const uct_rc_gdaki_dev_ep_t *ep, uint16_t wqe_idx)
 {
-    return (wqe_idx & ep->sq_fc_mask) == 1;
+    return !(wqe_idx & ep->sq_fc_mask);


logic: flow-control condition inverted from (wqe_idx & mask) == 1 to !(wqe_idx & mask); these produce different true sets unless mask is crafted so that & mask yields only 0 or 1; confirm mask semantics. What is the intended range of ep->sq_fc_mask? If it's a multi-bit mask, the new logic will trigger FC on any zero result, not just when the masked value equals 1.

flow-control request completion at least on half of work queue size,
each time when sq_fc_mask (=wqe_size/2-1) & wqe_idx equal to same number. comparing to 0 supposed to save explicit comparison instruction

iyastreb · 2025-10-27T10:36:10Z

Maybe you can document performance implications (just % improve) of this change with 1-32-128 threads?

iyastreb · 2025-10-27T10:46:51Z

src/uct/ib/mlx5/gdaki/gdaki.cuh

+    uint32_t data      = READ_ONCE(*data_ptr);
+
+    *wqe_cnt = uct_rc_mlx5_gda_bswap16(data);
+    if (opcode != NULL) {


IMO better read it unconditionally than having a branch
Can just ignore the result if not needed

I expect this condition to be ruled out at compile time

iyastreb · 2025-10-27T14:31:30Z

src/uct/ib/mlx5/gdaki/gdaki.cuh

 uct_rc_mlx5_gda_reserv_wqe(uct_rc_gdaki_dev_ep_t *ep, unsigned count,
                           unsigned lane_id, uint64_t &wqe_base)
 {
-    wqe_base = 0;


I intentionally added zero initialization to avoid a crash with syndrome 68

why it cause this crash and how this initialization prevent it?
code looks like it's just overwritten by shuffle

Yes, this one was quite tricky, and I also struggled to understand.
So I asked chatgpt and gemini, and both pointed that uninitialized wqe_base leads to UB in some cases:

The CUDA execution model might still produce the correct result most of the time because the __shfl_sync instruction will force the other lanes to wait for lane 0 to arrive. When lane 0 finally executes the shuffle, its value will be correctly broadcast. However, relying on this implicit synchronization is dangerous and can lead to undefined behavior. The code is not robust because it makes assumptions about instruction scheduling and thread divergence that may not hold true on all GPU architectures or with future compiler versions. The most significant risk is that the compiler might perform optimizations based on the uninitialized value of wqe_base in the non-zero lanes before the shuffle call, leading to incorrect code generation.

This issue was not always reproducible on rock, but quite frequently failed in CI with syndrome 68.
So better keep this change

src/uct/ib/mlx5/gdaki/gdaki.cuh

iyastreb · 2025-10-27T14:42:11Z

test/gtest/ucp/cuda/test_kernels.h


 struct test_ucp_device_kernel_result_t {
    ucs_status_t status;
-    uint64_t     producer_index;


Can we keep the producer index and retrieve it from sq_rsvd_index maybe?

we could call uct_rc_mlx5_gda_read_cqe/calc_pi here

src/uct/ib/mlx5/gdaki/gdaki.cuh

omor1 · 2025-11-03T20:29:15Z

src/uct/ib/mlx5/gdaki/gdaki.c

-    init_attr.cq_len[UCT_IB_DIR_TX] = iface->super.super.config.tx_qp_len *
-                                      UCT_IB_MLX5_MAX_BB;
+    init_attr.cq_len[UCT_IB_DIR_TX] = 1;
    uct_ib_mlx5_cq_calc_sizes(&iface->super.super.super, UCT_IB_DIR_TX,
                              &init_attr, 0, &cq_attr);
    uct_rc_iface_fill_attr(&iface->super.super, &qp_attr.super,
                           iface->super.super.config.tx_qp_len, NULL);
    uct_ib_mlx5_wq_calc_sizes(&qp_attr);

    cq_attr.flags      |= UCT_IB_MLX5_CQ_IGNORE_OVERRUN;


UCT_IB_MLX5_CQ_IGNORE_OVERRUN causes the mlx5_ifc_cqc_bits::oi bit to be set in ibv_mlx5_dv.c. The most recently-available public documentation [Rev 0.40] indicates in §7.12.8, Tables 75–76 that the cc bit should be set to enable CQE collapsing and other applications that use the feature seem to set the bit. Is this a bug with UCX, or does UCX do something different that obviates the necessity of setting the bit?

coderabbitai · 2025-11-05T10:03:44Z

Note

Other AI code review bot(s) detected

CodeRabbit has detected other AI code review bot(s) in this pull request and will avoid duplicating their findings in the review comments. This may lead to a less comprehensive review.

Walkthrough

The changes refactor the GPU-accelerated InfiniBand RDMA transport implementation by introducing CQ context field initialization, simplifying TX completion queue calculations, replacing WQE reservation logic with a CQE-based parsing mechanism, and removing legacy state tracking fields across device endpoint structures and related tests.

Changes

Cohort / File(s)	Change Summary
CQ Context Initialization `src/uct/ib/mlx5/dv/ib_mlx5_dv.c`	Added conditional initialization of CQ context cc flag based on cq_size value (set to 1 if cq_size equals 1, otherwise 0).
Device EP Refactoring `src/uct/ib/mlx5/gdaki/gdaki.c`, `src/uct/ib/mlx5/gdaki/gdaki_dev.h`	Reduced TX completion queue length to fixed value; removed legacy state fields (sq_rsvd_index, sq_ready_index, sq_wqe_pi, cqe_ci, avail_count) from device endpoint struct; consolidated and simplified field initialization logic.
WQE Reservation & CQE Parsing `src/uct/ib/mlx5/gdaki/gdaki.cuh`	Introduced new helpers: uct_rc_mlx5_gda_bswap16, uct_rc_mlx5_gda_parse_cqe, uct_rc_mlx5_gda_max_alloc_wqe_base; replaced reservation logic with atomicAdd/atomicCAS rollback loop; updated FC check to use negated mask; removed progress thread body; refactored completion checks to use CQE-based parsing with error handling.
Test Alignment `test/gtest/ucp/cuda/test_kernels.cu`, `test/gtest/ucp/cuda/test_kernels.h`, `test/gtest/ucp/test_ucp_device.cc`	Updated producer_index calculation to use uct_rc_mlx5_gda_parse_cqe; removed avail_count field from test result struct and related assertions.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Key areas requiring careful attention:

Atomic reservation logic in gdaki.cuh: The new atomicAdd/atomicCAS rollback loop in uct_rc_mlx5_gda_reserv_wqe_thread must be verified for correctness, race condition handling, and deadlock prevention.
CQE parsing implementation in gdaki.cuh: Validate uct_rc_mlx5_gda_parse_cqe correctly extracts wqe_cnt and opcode; error case handling via opcode inspection.
Field removal cascade: Ensure all removals from uct_rc_gdaki_dev_ep_t struct (sq_wqe_pi, cqe_ci, avail_count) don't leave orphaned references outside the changed files.
Completion check flow: The transition from state field tracking to CQE-based parsing requires verification that completion decisions remain correct.
Flow control check reversal: The FC mask negation logic change should be validated against expected behavior.

Poem

🐰 Queues optimized with parsing so clever,
Atomic dances, reservations better,
Legacy fields fade to history's store,
Device endpoints leaner than before!
✨ GDAKI bounds the GPU compute, once more.

Pre-merge checks and finishing touches

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title 'UCT/GDA: Collapsed CQ' refers to a specific component change (Collapsed CQ in UCT/GDA), which is a real part of the changeset involving CQ (completion queue) modifications across multiple files.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 6969a83 and 452dbfc.

📒 Files selected for processing (7)

src/uct/ib/mlx5/dv/ib_mlx5_dv.c (1 hunks)
src/uct/ib/mlx5/gdaki/gdaki.c (2 hunks)
src/uct/ib/mlx5/gdaki/gdaki.cuh (3 hunks)
src/uct/ib/mlx5/gdaki/gdaki_dev.h (0 hunks)
test/gtest/ucp/cuda/test_kernels.cu (1 hunks)
test/gtest/ucp/cuda/test_kernels.h (0 hunks)
test/gtest/ucp/test_ucp_device.cc (0 hunks)

💤 Files with no reviewable changes (3)

src/uct/ib/mlx5/gdaki/gdaki_dev.h
test/gtest/ucp/cuda/test_kernels.h
test/gtest/ucp/test_ucp_device.cc

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (9)

GitHub Check: UCX PR (Static_check Static checks)
GitHub Check: UCX PR (Codestyle ctags check)
GitHub Check: UCX PR (Codestyle codespell check)
GitHub Check: UCX PR (Codestyle format code)
GitHub Check: UCX PR (Codestyle commit title)
GitHub Check: UCX PR (Codestyle AUTHORS file update check)
GitHub Check: UCX release DRP (Prepare CheckRelease)
GitHub Check: UCX release (Prepare CheckRelease)
GitHub Check: UCX snapshot (Prepare Check)

coderabbitai · 2025-11-05T10:09:44Z

src/uct/ib/mlx5/gdaki/gdaki.cuh

+    pi = uct_rc_mlx5_gda_parse_cqe(ep, &wqe_cnt, nullptr);
+    return pi + ep->sq_wqe_num + 1 - count;
+}


⚠️ Potential issue | 🔴 Critical

Fix off-by-one in max alloc calculation.
uct_rc_mlx5_gda_max_alloc_wqe_base() currently returns pi + sq_wqe_num + 1 - count. When the SQ is full (sq_rsvd_index == pi + sq_wqe_num) and a thread asks for one more WQE, this formula lets the fast-path guard pass, atomicAdd() succeeds, and we end up with sq_rsvd_index == pi + sq_wqe_num + 1. That means we wrap and reuse a slot that still holds an in-flight WQE, corrupting the ring and breaking progress. Drop the extra + 1 so the reservation limit never exceeds the queue capacity.

- return pi + ep->sq_wqe_num + 1 - count; + return pi + ep->sq_wqe_num - count;

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

pi = uct_rc_mlx5_gda_parse_cqe(ep, &wqe_cnt, nullptr);

return pi + ep->sq_wqe_num + 1 - count;

}

pi = uct_rc_mlx5_gda_parse_cqe(ep, &wqe_cnt, nullptr);

return pi + ep->sq_wqe_num - count;

}

🤖 Prompt for AI Agents

In src/uct/ib/mlx5/gdaki/gdaki.cuh around lines 133 to 135, the calculation in uct_rc_mlx5_gda_max_alloc_wqe_base() uses "pi + sq_wqe_num + 1 - count" which allows sq_rsvd_index to exceed the queue capacity and reuse an in-flight WQE; remove the stray "+ 1" so the function returns "pi + sq_wqe_num - count" (i.e., ensure the reservation limit never exceeds sq_wqe_num) to prevent off-by-one wraparound and ring corruption.

src/uct/ib/mlx5/gdaki/gdaki.cuh

UCT/GDA: Collapsed CQ

5133a6b

Artemy-Mellanox requested a review from ofirfarjun7 October 20, 2025 12:34

Artemy-Mellanox added 2 commits October 25, 2025 11:52

Merge remote-tracking branch 'origin/master' into topic/collapsed_cq-2

53139a0

UCT/GDA: Collapsed CQ - 2

f00182e

greptile-apps bot reviewed Oct 25, 2025

View reviewed changes

Artemy-Mellanox force-pushed the topic/collapsed_cq-2 branch from 10d120a to f00182e Compare October 25, 2025 13:05

greptile-apps bot reviewed Oct 25, 2025

View reviewed changes

iyastreb reviewed Oct 27, 2025

View reviewed changes

ofirfarjun7 reviewed Oct 27, 2025

View reviewed changes

src/uct/ib/mlx5/gdaki/gdaki.cuh Outdated Show resolved Hide resolved

UCT/GDA: Fix locking

0f0e3ec

omor1 reviewed Nov 3, 2025

View reviewed changes

Artemy-Mellanox added 2 commits November 4, 2025 14:40

Merge remote-tracking branch 'origin/master' into topic/collapsed_cq-2

60db513

UCT/GDA: Collapsed CQ - 3

452dbfc

coderabbitai bot reviewed Nov 5, 2025

View reviewed changes

ofirfarjun7 reviewed Nov 5, 2025

View reviewed changes

src/uct/ib/mlx5/gdaki/gdaki.cuh Show resolved Hide resolved

ofirfarjun7 approved these changes Nov 6, 2025

View reviewed changes

ofirfarjun7 enabled auto-merge (squash) November 6, 2025 13:25

UCT/GDA: Collapsed CQ #10959

Are you sure you want to change the base?

UCT/GDA: Collapsed CQ #10959

Conversation

Artemy-Mellanox commented Oct 20, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Release Notes

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Greptile Overview

Greptile Summary

Important Files Changed

Confidence score: 2/5

Sequence Diagram

Uh oh!

greptile-apps bot Oct 25, 2025

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Oct 25, 2025

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Oct 25, 2025

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Greptile Overview

Greptile Summary

Important Files Changed

Confidence score: 2/5

Sequence Diagram

Uh oh!

greptile-apps bot Oct 25, 2025

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Oct 25, 2025

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

iyastreb commented Oct 27, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot commented Nov 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Other AI code review bot(s) detected

Walkthrough

Changes

Estimated code review effort

Poem

Pre-merge checks and finishing touches

Artemy-Mellanox commented Oct 20, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Nov 5, 2025 •

edited

Loading