Skip to content

fix(cuda): l_skip bug#292

Open
gaxiom wants to merge 1 commit intodevelop-v2from
fix/l-skip-bug
Open

fix(cuda): l_skip bug#292
gaxiom wants to merge 1 commit intodevelop-v2from
fix/l-skip-bug

Conversation

@gaxiom
Copy link
Contributor

@gaxiom gaxiom commented Mar 10, 2026

Closes INT-6149

@gaxiom gaxiom requested a review from jonathanpwang March 10, 2026 23:12
@gaxiom
Copy link
Contributor Author

gaxiom commented Mar 10, 2026

PR Summary: Fix Blackwell regression in zerocheck round 0 coset-parallel kernel

Overview

Fixes a CUDA memory ordering bug in zerocheck_ntt_evaluate_constraints_coset_parallel_kernel that caused the test_fib_air_roundtrip::where_l_skip_exceeds_log_warp_size test to fail on RTX 5090 (Blackwell) while passing on H100 (Hopper). The root cause was stale reads from a mutable NttEvalContext<1> struct that was initialized outside the per-x_int loop and mutated inside it. Blackwell's more aggressive out-of-order execution exposed the undefined behavior that Hopper masked. The fix replaces the context struct with two new direct-parameter functions, eliminating all mutable shared state between loop iterations.


crates/cuda-backend/cuda/src/logup_zerocheck/zerocheck_round0.cu

Root cause

zerocheck_ntt_evaluate_constraints_coset_parallel_kernel initialized a single NttEvalContext<1> struct before the main x_int loop, then updated eval_ctx.is_first[0], eval_ctx.is_last[0], and eval_ctx.x_int inside each iteration. The generic acc_constraints<1, NEEDS_SHMEM>() then read these array fields. On Blackwell, the compiler or hardware did not guarantee that each thread's read of is_first[0] / is_last[0] saw the write from the current iteration rather than a prior one. Passing CUDA_DEBUG=1 (which inserts additional synchronization) made the test pass, confirming the nature of the race.

Fix

Two new device functions replace the generic path for this kernel:

  • eval_dag_entry_single<NEEDS_SHMEM>() — mirrors ntt_eval_dag_entry but takes is_first, is_last, and x_int as scalar parameters rather than reading them from a context struct.
  • acc_constraints_single<NEEDS_SHMEM>() — mirrors acc_constraints<1, ...> but takes all parameters directly, with no context struct involved.

The kernel loop now computes is_first and is_last as fresh local registers each iteration and passes them directly:

// Before
NttEvalContext<1> eval_ctx{ ..., {Fp::zero()}, {Fp::zero()}, ... };
for (uint32_t x_int = ...) {
    eval_ctx.x_int     = x_int;
    eval_ctx.is_first[0] = is_first_mult * selectors_cube[x_int];      // write to struct
    eval_ctx.is_last[0]  = is_last_mult  * selectors_cube[2*num_x + x_int];
    acc_constraints<1, NEEDS_SHMEM>(constraint_sums, eval_ctx, ...);   // reads struct fields
}

// After
for (uint32_t x_int = ...) {
    Fp is_first = is_first_mult * selectors_cube[x_int];               // local register
    Fp is_last  = is_last_mult  * selectors_cube[2*num_x + x_int];
    acc_constraints_single<NEEDS_SHMEM>(constraint_sums, ..., is_first, is_last, x_int, ...);
}

Notes for reviewers

  • Intentional fork, not deduplication. eval_dag_entry_single and acc_constraints_single are a deliberate codegen fork of the generic NttEvalContext/acc_constraints path for the NUM_COSETS == 1 case. A comment in the source explicitly warns against merging them back without re-validating on Blackwell.
  • __syncthreads() correctness. The barrier inside eval_dag_entry_single (before ntt_natural_to_bitrev) is only reached on trace-column source types (ENTRY_PREPROCESSED, ENTRY_MAIN). This is safe because src.type is derived from the same DAG rule for all threads in the block, making the branch warp-uniform — either all threads hit the barrier or none do.
  • Sanitizer scope. This fix eliminates the Blackwell regression and the fib test failure. Two unrelated racecheck warnings in _CT_NTT / supra/ntt.cu remain; those are pre-existing and tracked separately.

@jonathanpwang
Copy link
Contributor

@claude is there a way we can update the code to fix both the bug and avoid the code duplication?

@claude
Copy link

claude bot commented Mar 11, 2026

Claude finished @jonathanpwang's task in 1m 54s —— View job


Analyzing code duplication avoidance options

  • Read the changed file and related generic code
  • Analyze the root cause and possible refactoring approaches
  • Post analysis and recommendation

Analysis

I've read through the full code — the changed file, the NttEvalContext struct and ntt_eval_dag_entry in dag_entry.cuh, and the logup coset-parallel kernel (which has the same pattern and is likely also vulnerable on Blackwell).

The actual root cause

The PR description says the bug is stale reads from NttEvalContext<1> fields (is_first[0], is_last[0], x_int) that are mutated inside the loop. But looking at the code more carefully, this is a single-thread data flow — each thread writes to its own local eval_ctx struct and reads from it in the same thread. There is no cross-thread data race on these fields. CUDA threads have well-defined sequential semantics for their own register/local memory.

The more likely explanation is that the NVCC compiler on Blackwell (sm_120) is incorrectly optimizing the struct mutation pattern — e.g., hoisting reads of eval_ctx.is_first[0] out of the loop because it doesn't see the write through the const & reference parameter as aliasing the same storage, or the array-in-struct pattern is confusing the alias analysis. This is essentially a compiler codegen bug triggered by passing a mutable struct by const & after mutating it in the same scope.

Can we fix the bug without duplicating code?

Yes, there are several approaches that maintain a single code path:

Option 1: Pass is_first, is_last, x_int as separate parameters to the generic functions (recommended)

Refactor NttEvalContext to remove the per-iteration mutable fields (is_first[], is_last[], x_int) and pass them as explicit parameters to ntt_eval_dag_entry and acc_constraints. This makes the data flow explicit to both the compiler and human readers:

// In dag_entry.cuh — remove mutable per-iteration fields from context:
template <uint32_t NUM_COSETS> struct NttEvalContext {
    const Fp *__restrict__ preprocessed;
    const Fp *const *__restrict__ main_parts;
    const Fp *__restrict__ public_values;
    Fp *__restrict__ inter_buffer;
    Fp *__restrict__ ntt_buffer;
    // REMOVED: is_first, is_last (now passed as params)
    Fp omega_shifts[NUM_COSETS];
    uint32_t skip_domain;
    uint32_t height;
    uint32_t buffer_stride;
    uint32_t buffer_size;
    uint32_t ntt_idx;
    // REMOVED: x_int (now passed as param)
};

// Updated signatures:
template <uint32_t NUM_COSETS, bool NEEDS_SHMEM, bool FIRST_COSET_IS_IDENTITY = false>
__device__ __forceinline__ void ntt_eval_dag_entry(
    Fp *__restrict__ results,
    const SourceInfo &src,
    const NttEvalContext<NUM_COSETS> &ctx,
    const Fp *is_first,    // [NUM_COSETS] - passed by caller
    const Fp *is_last,     // [NUM_COSETS] - passed by caller  
    uint32_t x_int,        // passed by caller
    bool skip_ntt = false
);

// Similarly for acc_constraints / acc_interactions

This is the cleanest fix because:

  • Zero code duplication — one code path for all callers
  • Fixes the Blackwell bug at the source — per-iteration values live in caller's registers, passed directly as function args (which NVCC handles correctly)
  • Also fixes the same latent bug in logup_round0.cu (lines 400-421), which has the exact same mutable-struct-in-loop pattern and will likely break on Blackwell too
  • The compiler sees scalar/array params instead of struct field reads through a reference, eliminating the aliasing confusion

Option 2: Mark mutable fields as volatile

volatile Fp is_first[NUM_COSETS];
volatile Fp is_last[NUM_COSETS];
volatile uint32_t x_int;

This would prevent the compiler from optimizing away the reads, but it's a band-aid that pessimizes codegen for all architectures and doesn't fix the root design issue.

Option 3: Construct a fresh NttEvalContext each iteration

for (uint32_t x_int = x_int_base; x_int < num_x; x_int += x_int_stride) {
    Fp is_first = is_first_mult * selectors_cube[x_int];
    Fp is_last = is_last_mult * selectors_cube[2 * num_x + x_int];
    NttEvalContext<1> eval_ctx{ ..., {is_first}, {is_last}, ..., x_int };
    acc_constraints<1, NEEDS_SHMEM>(constraint_sums, eval_ctx, ...);
}

This might work but still relies on the compiler correctly handling the struct-through-reference pattern. Less confidence it fixes the Blackwell issue.

Recommendation

Option 1 is the best path forward. It:

  1. Eliminates the duplicated eval_dag_entry_single and acc_constraints_single (~130 lines)
  2. Fixes the bug with certainty (register params, not struct field reads)
  3. Proactively fixes the same pattern in logup_round0.cu (which currently uses NttEvalContext<1> with the same mutable-struct-in-loop at line 400-421)
  4. Makes the API more honest — the "context" only contains truly loop-invariant state

The trade-off is that it touches dag_entry.cuh (shared header) and requires updating all call sites in both zerocheck_round0.cu and logup_round0.cu, but it's a mechanical refactor with no semantic changes.

If you'd like me to implement Option 1, let me know and I can push a commit.


@chatgpt-codex-connector

This comment was marked as resolved.

@jonathanpwang
Copy link
Contributor

@codex is there a way we can update the code to fix both the bug and avoid the code duplication?

@chatgpt-codex-connector
Copy link

Summary

Testing

  • ⚠️ python - <<'PY' ... urllib.request.urlopen('https://duckduckgo.com/html/?q=...') ... PY (web search attempt failed in this environment due to proxy/tunnel 403 Forbidden).
  • ⚠️ cargo check -p openvm-cuda-backend (fails in this container because CUDA is unavailable: build script reports CUDA is not available).
  • cargo check -p openvm-stark-backend
  • cargo test -p openvm-stark-backend --lib -- --nocapture

View task →

leonardoalt pushed a commit to powdr-labs/stark-backend that referenced this pull request Mar 11, 2026
…PleMatrix` (openvm-org#292)

This PR effectively removes any materialization of the stacked matrix
except in the Reed-Solomon codeword matrix. It was more complicated than
expected because it means any time we do something that should
theoretically be about the stacked matrix, we must stack it as we go in
the cuda kernels. Unfortunately this breaks any abstraction boundary we
could keep with respect to stacking (i.e., now every kernel needs to
replicate the stacking in exactly the same way) -- I'm not sure how to
fix that.

Also re-organized barycentric evaluation utilities since we'll use them
more.

closes INT-5637
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants