Skip to content

Legacy translator: Incorrect stride for map indexing in CUDA kernels #255

@mattbuergler

Description

@mattbuergler

I'm observing a mismatch between host-side and device-side indirect map access in generated CUDA kernels. The root cause is an incorrect stride used in the generated CUDA kernel when indexing the transposed map_data_d array.

Summary

  • The function op_decl_map() in op2/src/cuda/op_cuda_decl.cpp transposes host maps to the device layout and each column is padded to a multiple of 32 using round32(set_size).
  • However, the generated CUDA kernels continue to index map_data_d using the unpadded stride set_size = set->size + set->exec_size, e.g.:
map1idx = opDat1Map[n + set_size * 0];
map2idx = opDat1Map[n + set_size * 1];
map3idx = opDat1Map[n + set_size * 2];
  • This produces misaligned column access and the kernel reads into the padding region (zeros / uninitialized values) instead of the next column.
  • By debugging, I confirmed that:
    • Column 0 data is correct (e.g., all map1idx in the above example).
    • At indices [set_size .. round32(set_size)-1] the map contains padding zeros.
    • Column 1 starts at round32(set_size), not at set_size, resulting in wrong map2idx and map3idx .

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions