Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[AMDGPU] error running program when compiled with asan #127241

Closed
zyx-billy opened this issue Feb 14, 2025 · 14 comments
Closed

[AMDGPU] error running program when compiled with asan #127241

zyx-billy opened this issue Feb 14, 2025 · 14 comments
Labels
compiler-rt:asan Address sanitizer

Comments

@zyx-billy
Copy link
Contributor

I'm working on adding asan when emitting for amd gpus using the upstream llvm backend. Right now I'm doing these things:

  • Add +xnack to target features
  • Enable AddressSanitizerPass in the llvm pipeline

And when running the program, I explicitly link the asan libraries, and enable xnack:

LD_PRELOAD=/opt/rocm/lib/llvm/lib/clang/18/lib/linux/libclang_rt.asan-x86_64.so:/opt/rocm/lib/asan/libamdhip64.so
ASAN_OPTIONS=detect_leaks=0
HSA_XNACK=1

But I run into this error when trying to invoke a kernel:

hipErrorNoBinaryForGpu (no kernel image is available for execution on the device)

Removing the asan llvm pass makes the program run fine (but of course, it won't detect any errors), indicating that everything else seems to work. I also tried linking asanrtl.bc, ocml.bc, & ockl.bc into the IR before running llvm passes (following the impl here), but got the same error.

My questions are:

  • Are there any other configs that should be set when compiling?
  • Is there any way to tell what exactly is incompatible when running? That'll help narrow down what to do.

Happy to provide more context. Thank you!

@EugeneZelenko EugeneZelenko added compiler-rt:asan Address sanitizer and removed new issue labels Feb 14, 2025
@EugeneZelenko
Copy link
Contributor

Could you please try 19, 20 release candidate or main branch?

@searlmc1
Copy link
Collaborator

@b-sumner
Copy link

@zyx-billy Can you dump the e_flags from the code object you are producing? What GPU and code object version are you targeting? What kind of GPU do you have on the system? You can see how to decode the flags at https://llvm.org/docs/AMDGPUUsage.html#header .

@zyx-billy
Copy link
Contributor Author

zyx-billy commented Feb 14, 2025

Thanks for the response. Right now we're using a pretty up to date version of main (ea6827c as of 4 days ago).

Here's the gpu on our system from HSA_XNACK=1 rocminfo | grep Name:

  ...
  Name:                    gfx942                             
  Marketing Name:          AMD Instinct MI300X                
  Vendor Name:             AMD                                
      Name:                    amdgcn-amd-amdhsa--gfx942:sramecc+:xnack+

Our code is targeting gfx942 with code object version 5. This is the top of our amdgcn:

	.amdgcn_target "amdgcn-amd-amdhsa--gfx942:xnack+"
	.amdhsa_code_object_version 5

And the e_flags from our code object is 0x74c, which seems to be correct?

ELF Header:
  Magic:   7f 45 4c 46 02 01 01 40 03 00 00 00 00 00 00 00 
  Class:                             ELF64
  Data:                              2's complement, little endian
  Version:                           1 (current)
  OS/ABI:                            <unknown: 40>
  ABI Version:                       3
  Type:                              DYN (Shared object file)
  Machine:                           AMD GPU
  Version:                           0x1
  Entry point address:               0x0
  Start of program headers:          64 (bytes into file)
  Start of section headers:          120192 (bytes into file)
  Flags:                             0x74c
  Size of this header:               64 (bytes)
  Size of program headers:           56 (bytes)
  Number of program headers:         9
  Size of section headers:           64 (bytes)
  Number of section headers:         20
  Section header string table index: 18

BTW, in case sramecc mattered, I added it to our target features when compiling (this resulted in e_flags of 0xf4c), but it also ran into the same error.

One random thought, is it possible the .bc files I linked are somehow incompatible? FWIW, linking them but not running the asan pass also results in the same error (or maybe that's expected behavior). If I don't link them, and also don't run the asan pass, the program runs fine.

@CRobeck
Copy link
Member

CRobeck commented Feb 14, 2025

A couple thoughts about what it might be:

  1. asanrtl.bc, since it is being pull directly that Triton PR, is not compiled for the correct target or from the wrong version of LLVM. If you did pull those .bc files directly from my Triton PR can you instead grab them from the ROCM/LLVM asan dir?
  2. asanrtl.bc is not being linked in correctly to the generated binary (i.e. it's not being passed down into linker correctly). You'd have to build/run a debug version of the runtime probably to see this.

Are you JIT compiling this? Do you have a .bc for the actual kernel and can you try merging/linking those bc files and see if you get a clash from llvm-as?

@zyx-billy
Copy link
Contributor Author

Oh yes I got the .bc files from the rocm sdk on our system. Is it possible that's too old still? I can try a newer release.

/opt/rocm-6.2.0/lib/llvm/lib/clang/18/lib/amdgcn/bitcode/asanrtl.bc

I'm compiling this ahead of time. Basically I have the in-memory LLVM IR of our kernel, and I used the exact same linking logic from the triton impl here on my IR module. Then I ran it through llvm optimizations and backend lowering passes to get the object file.

@b-sumner
Copy link

Can you see any more information about the failure when setting environment AMD_LOG_LEVEL=2?

Is the LLVM version you're using to create code objects 18 or something newer?

Can you move to ROCm 6.3.2?

@CRobeck
Copy link
Member

CRobeck commented Feb 14, 2025

Can you also double check the fields in the attributes of the .ll IR file? You should see something like:
attributes #0 = { ..."target-features"="+xnack" }
attributes #1 = {...sanitize_address }
you should be able to just llvm-dis your combined .bc kernel file.

@zyx-billy
Copy link
Contributor Author

hmm, the only additional output I get with AMD_LOG_LEVEL=2 is

:1:hip_fatbin.cpp           :91  : 5535792822252 us: [pid:164883 tid:0x7fe5fba447c0] All Unique FDs are closed

And yes I see these attributes on our kernel in the combined IR:

attributes #0 = { ... sanitize_address "target-cpu"="gfx942" "target-features"="+xnack"}

I tried linking in the .bc files after optimization passes instead, but it didn't make a difference.

The LLVM I'm using is very recent (< 1 week old on main). I'll retry with the latest ROCm release.

@b-sumner
Copy link

The closer you can get the device library to the compiler you're using the better. @CRobeck and I have seen this before elsewhere, but I'm not clear on exactly what cleared it up then.

@zyx-billy
Copy link
Contributor Author

Unfortunately I get the same error with 6.3.2 (and there's also no additional output under AMD_LOG_LEVEL=2 anymore). Though it looks like the updated asanrtl.bc library is also created with clang 18 (the contents of the library does differ).

@b-sumner
Copy link

Does environment LOADER_ENABLE_LOGGING=1 give any additional output?

@zyx-billy
Copy link
Contributor Author

oh amazing! Just what I was looking for. It gives:

LoaderError: symbol "__oclc_ABI_version" is undefined

And indeed all I see is this in our combined IR:

@__oclc_ABI_version = external local_unnamed_addr addrspace(4) constant i32, align 4

The same goes for __oclc_ISA_version and __oclc_wavefrontsize64.

I looked around and found that these values need to be set onto the IR. When I manually added them by linking to the relevant .bc files that came with the install (e.g. oclc_abi_version_500.bc), I no longer get any errors loading the kernel! (For posterity, when lowering through the standard MLIR path, I found the LLVM/ROCDL target lowering logic that does this).

Testing with a correct program, it runs to completion without errors. Testing with an out-of-bounds array access, I get an asan report correctly (with debuginfo interpreted correctly). Thank you for all of your help 🙏 ! This has been immensely helpful.

@zyx-billy
Copy link
Contributor Author

Oh and btw, was able to confirm this works on 6.2.0 too. Closing the issue then. Thank you! 🙏

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
compiler-rt:asan Address sanitizer
Projects
None yet
Development

No branches or pull requests

6 participants