-
Notifications
You must be signed in to change notification settings - Fork 54
Open
Description
I'm observing a mismatch between host-side and device-side indirect map access in generated CUDA kernels. The root cause is an incorrect stride used in the generated CUDA kernel when indexing the transposed map_data_d array.
Summary
- The function
op_decl_map()inop2/src/cuda/op_cuda_decl.cpptransposes host maps to the device layout and each column is padded to a multiple of 32 usinground32(set_size). - However, the generated CUDA kernels continue to index
map_data_dusing the unpadded strideset_size = set->size + set->exec_size, e.g.:
map1idx = opDat1Map[n + set_size * 0];
map2idx = opDat1Map[n + set_size * 1];
map3idx = opDat1Map[n + set_size * 2];- This produces misaligned column access and the kernel reads into the padding region (zeros / uninitialized values) instead of the next column.
- By debugging, I confirmed that:
- Column 0 data is correct (e.g., all
map1idxin the above example). - At indices
[set_size .. round32(set_size)-1]the map contains padding zeros. - Column 1 starts at
round32(set_size), not atset_size, resulting in wrongmap2idxandmap3idx.
- Column 0 data is correct (e.g., all
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels