Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SYCL][ext] Define and Implement sycl_ext_tensor_map #16247

Open
wants to merge 1 commit into
base: sycl
Choose a base branch
from

Conversation

ldrumm
Copy link
Contributor

@ldrumm ldrumm commented Dec 3, 2024

This is a fairly mechanical implementation of the basic infrastructure
required to access CUDA TMA descriptors from within SYCL kernels, while
initializing them on the host. The new feature exposes two new classes
and associated support structure in
sycl::ext::codeplay::experimental::cuda.

There's some ugliness involved to make this work on account of the way
NVIDIA implemented this basic feature, but it's all in the name of
{legitimate-field-of-endeavour}.

@ldrumm ldrumm requested review from a team as code owners December 3, 2024 17:04
@ldrumm ldrumm requested a review from againull December 3, 2024 17:04
@ldrumm
Copy link
Contributor Author

ldrumm commented Dec 3, 2024

This depends on Hugh's work for Unified Runtime here

Copy link
Contributor

@AlexeySachkov AlexeySachkov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DeviceConfigFile changes LGTM

@ldrumm ldrumm force-pushed the luke/ext_cuda_tensor_map branch from 3d2c85f to 16f22f6 Compare December 3, 2024 17:24
@ldrumm
Copy link
Contributor Author

ldrumm commented Dec 3, 2024

https://github.com/intel/llvm/actions/runs/12144906122/job/33865361253?pr=16247

I saw this build error locally due to a stale build tree. Are we not doing clean checkouts for CI?

@sarnex
Copy link
Contributor

sarnex commented Dec 3, 2024

No we used cached checkouts

@ldrumm
Copy link
Contributor Author

ldrumm commented Dec 3, 2024

No we used cached checkouts

How do I clear them?

@sarnex
Copy link
Contributor

sarnex commented Dec 3, 2024

I have to log into the runners and do it manually, but I don't know if other PRs will end up in the cache and cause the same problem. I'll try, give me a sec.

@ldrumm
Copy link
Contributor Author

ldrumm commented Dec 3, 2024

I have to log into the runners and do it manually, but I don't know if other PRs will end up in the cache and cause the same problem. I'll try, give me a sec.

Thanks. I'll see if I can track down why the ur_api_funcs.def isn't considered out of date when the UR repo fetchcontent changes

@ldrumm
Copy link
Contributor Author

ldrumm commented Dec 3, 2024

add_custom_command(
  OUTPUT  ${OUT_HEADERS_IN_SYCL_DIR}
          ${OUT_HEADERS_IN_CL_DIR}
          ${OUT_HEADERS_IN_STD_DIR}
          ${OUT_HEADERS_IN_SYCLCOMPAT_DIR}
  DEPENDS ${HEADERS_IN_SYCL_DIR}
          ${HEADERS_IN_CL_DIR}
          ${HEADERS_IN_STD_DIR}
          ${HEADERS_IN_SYCLCOMPAT_DIR}
  COMMAND ${CMAKE_COMMAND} -E copy_directory ${sycl_inc_dir}/sycl ${SYCL_INCLUDE_BUILD_DIR}/sycl
  COMMAND ${CMAKE_COMMAND} -E copy_directory ${sycl_inc_dir}/CL ${SYCL_INCLUDE_BUILD_DIR}/CL
  COMMAND ${CMAKE_COMMAND} -E copy_directory ${sycl_inc_dir}/std ${SYCL_INCLUDE_BUILD_DIR}/std
  COMMAND ${CMAKE_COMMAND} -E copy_directory ${sycl_inc_dir}/syclcompat ${SYCL_INCLUDE_BUILD_DIR}/syclcompat
  COMMAND ${CMAKE_COMMAND} -E copy ${sycl_inc_dir}/syclcompat.hpp ${SYCL_INCLUDE_BUILD_DIR}/syclcompat.hpp
  COMMAND ${CMAKE_COMMAND} -E copy ${UNIFIED_RUNTIME_INCLUDE_DIR}/ur_api.h ${SYCL_INCLUDE_BUILD_DIR}
  COMMAND ${CMAKE_COMMAND} -E copy ${UNIFIED_RUNTIME_INCLUDE_DIR}/ur_api_funcs.def ${SYCL_INCLUDE_BUILD_DIR}
  COMMAND ${CMAKE_COMMAND} -E copy ${UNIFIED_RUNTIME_INCLUDE_DIR}/ur_print.hpp ${SYCL_INCLUDE_BUILD_DIR}
  COMMENT "Copying SYCL headers ...")

Yeah there's no dependency on the input files for the UR headers. I'll submit a patch

@sarnex
Copy link
Contributor

sarnex commented Dec 3, 2024

Cool. I tried clearing the cache but it didn't work because we check out intel/llvm HEAD first, so we hit the problem again. Ping me on the PR for the CMake fix and I'll try to fast track it

@ldrumm
Copy link
Contributor Author

ldrumm commented Dec 4, 2024

Yeah there's no dependency on the input files for the UR headers. I'll submit a patch

#16261

@ldrumm ldrumm force-pushed the luke/ext_cuda_tensor_map branch from 43d7ed1 to c8dee17 Compare December 5, 2024 16:41
Copy link
Contributor

@againull againull left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please add tests for this feature.

should not rely on APIs defined in this specification.* It is likely to be
generalized and significantly change in later revisions as more backend vendors
implement analogous features of more or less expressivity and generality than
shown here.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not wild about adding this new category of "oneapi-only" extension. I'm guessing this is for XeTLA? If that is the only library we expect to use this, why not add the support directly in that library, rather than adding a SYCL API for it? I imagine this would end up calling CUDA APIs or inline asm statements, but I think XeTLA does this already for other devices. I'd feel differently if we were adding a general API that other applications could make use of, but that's not what we're doing in this PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For CUTLASS, actually - but yeah it's not great to be so limiting.

We discussed a couple of ways to do this (with one of my suggestions being interop), but it seems there's little appetite to be including CUDA specific headers in these ports.

To be clear, there's no reason we really need to be so limiting with our wording here, I just added it since I'm not completely confident it has uses outside of the CUTLASS case and wanted to limit maintenance burden.

Would relaxing the language here be appropriate?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've used the standard boilerplate language for the Status section

This is a fairly mechanical implementation of the basic infrastructure
required to access CUDA TMA descriptors from within SYCL kernels, while
initializing them on the host. The new feature exposes two new classes
and associated support structure in
`sycl::ext::codeplay::experimental::cuda`.

There's some ugliness involved to make this work on account of the way
NVIDIA implemented this basic feature, but it's all in the name of
{legitimate-field-of-endeavour}.
@ldrumm ldrumm force-pushed the luke/ext_cuda_tensor_map branch from c8dee17 to 1fba3c9 Compare December 10, 2024 19:07
@ldrumm
Copy link
Contributor Author

ldrumm commented Dec 10, 2024

Could you please add tests for this feature.

@againull Good catch. I've added aspect and macro tests. The use of this feature requests sm90+ GPU and inline assembly, so I've ignored that part. Hope that's enough

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants