Conversation
PR Summary: Fix Blackwell regression in zerocheck round 0 coset-parallel kernelOverviewFixes a CUDA memory ordering bug in
|
|
@claude is there a way we can update the code to fix both the bug and avoid the code duplication? |
|
Claude finished @jonathanpwang's task in 1m 54s —— View job Analyzing code duplication avoidance options
AnalysisI've read through the full code — the changed file, the The actual root causeThe PR description says the bug is stale reads from The more likely explanation is that the NVCC compiler on Blackwell (sm_120) is incorrectly optimizing the struct mutation pattern — e.g., hoisting reads of Can we fix the bug without duplicating code?Yes, there are several approaches that maintain a single code path: Option 1: Pass Refactor // In dag_entry.cuh — remove mutable per-iteration fields from context:
template <uint32_t NUM_COSETS> struct NttEvalContext {
const Fp *__restrict__ preprocessed;
const Fp *const *__restrict__ main_parts;
const Fp *__restrict__ public_values;
Fp *__restrict__ inter_buffer;
Fp *__restrict__ ntt_buffer;
// REMOVED: is_first, is_last (now passed as params)
Fp omega_shifts[NUM_COSETS];
uint32_t skip_domain;
uint32_t height;
uint32_t buffer_stride;
uint32_t buffer_size;
uint32_t ntt_idx;
// REMOVED: x_int (now passed as param)
};
// Updated signatures:
template <uint32_t NUM_COSETS, bool NEEDS_SHMEM, bool FIRST_COSET_IS_IDENTITY = false>
__device__ __forceinline__ void ntt_eval_dag_entry(
Fp *__restrict__ results,
const SourceInfo &src,
const NttEvalContext<NUM_COSETS> &ctx,
const Fp *is_first, // [NUM_COSETS] - passed by caller
const Fp *is_last, // [NUM_COSETS] - passed by caller
uint32_t x_int, // passed by caller
bool skip_ntt = false
);
// Similarly for acc_constraints / acc_interactionsThis is the cleanest fix because:
Option 2: Mark mutable fields as volatile Fp is_first[NUM_COSETS];
volatile Fp is_last[NUM_COSETS];
volatile uint32_t x_int;This would prevent the compiler from optimizing away the reads, but it's a band-aid that pessimizes codegen for all architectures and doesn't fix the root design issue. Option 3: Construct a fresh for (uint32_t x_int = x_int_base; x_int < num_x; x_int += x_int_stride) {
Fp is_first = is_first_mult * selectors_cube[x_int];
Fp is_last = is_last_mult * selectors_cube[2 * num_x + x_int];
NttEvalContext<1> eval_ctx{ ..., {is_first}, {is_last}, ..., x_int };
acc_constraints<1, NEEDS_SHMEM>(constraint_sums, eval_ctx, ...);
}This might work but still relies on the compiler correctly handling the struct-through-reference pattern. Less confidence it fixes the Blackwell issue. RecommendationOption 1 is the best path forward. It:
The trade-off is that it touches If you'd like me to implement Option 1, let me know and I can push a commit. |
This comment was marked as resolved.
This comment was marked as resolved.
|
@codex is there a way we can update the code to fix both the bug and avoid the code duplication? |
|
Summary
Testing
|
…PleMatrix` (openvm-org#292) This PR effectively removes any materialization of the stacked matrix except in the Reed-Solomon codeword matrix. It was more complicated than expected because it means any time we do something that should theoretically be about the stacked matrix, we must stack it as we go in the cuda kernels. Unfortunately this breaks any abstraction boundary we could keep with respect to stacking (i.e., now every kernel needs to replicate the stacking in exactly the same way) -- I'm not sure how to fix that. Also re-organized barycentric evaluation utilities since we'll use them more. closes INT-5637
Closes INT-6149