Legacy translator: Incorrect stride for map indexing in CUDA kernels

I'm observing a mismatch between host-side and device-side indirect map access in generated CUDA kernels. The root cause is an incorrect stride used in the generated CUDA kernel when indexing the transposed `map_data_d` array.

**Summary**

* The function `op_decl_map()` in `op2/src/cuda/op_cuda_decl.cpp` transposes host maps to the device layout and each column is padded to a multiple of 32 using `round32(set_size)`.
* However, the generated CUDA kernels continue to index `map_data_d` using the **unpadded stride** `set_size  = set->size + set->exec_size`, e.g.:

```cpp
map1idx = opDat1Map[n + set_size * 0];
map2idx = opDat1Map[n + set_size * 1];
map3idx = opDat1Map[n + set_size * 2];
```

* This produces misaligned column access and the kernel reads into the padding region (zeros / uninitialized values) instead of the next column.
* By debugging, I confirmed that:
  * Column 0 data is correct (e.g., all `map1idx` in the above example).
  * At indices `[set_size .. round32(set_size)-1]` the map contains padding zeros.
  * Column 1 starts at `round32(set_size)`, not at `set_size`, resulting in wrong `map2idx ` and `map3idx `.



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Legacy translator: Incorrect stride for map indexing in CUDA kernels #255

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Legacy translator: Incorrect stride for map indexing in CUDA kernels #255

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions