[AMDGPU] error running program when compiled with asan #127241

zyx-billy · 2025-02-14T18:40:34Z

I'm working on adding asan when emitting for amd gpus using the upstream llvm backend. Right now I'm doing these things:

Add +xnack to target features
Enable AddressSanitizerPass in the llvm pipeline

And when running the program, I explicitly link the asan libraries, and enable xnack:

LD_PRELOAD=/opt/rocm/lib/llvm/lib/clang/18/lib/linux/libclang_rt.asan-x86_64.so:/opt/rocm/lib/asan/libamdhip64.so
ASAN_OPTIONS=detect_leaks=0
HSA_XNACK=1

But I run into this error when trying to invoke a kernel:

hipErrorNoBinaryForGpu (no kernel image is available for execution on the device)

Removing the asan llvm pass makes the program run fine (but of course, it won't detect any errors), indicating that everything else seems to work. I also tried linking asanrtl.bc, ocml.bc, & ockl.bc into the IR before running llvm passes (following the impl here), but got the same error.

My questions are:

Are there any other configs that should be set when compiling?
Is there any way to tell what exactly is incompatible when running? That'll help narrow down what to do.

Happy to provide more context. Thank you!

The text was updated successfully, but these errors were encountered:

EugeneZelenko · 2025-02-14T18:44:12Z

Could you please try 19, 20 release candidate or main branch?

searlmc1 · 2025-02-14T18:47:15Z

@b-sumner

b-sumner · 2025-02-14T18:57:55Z

@zyx-billy Can you dump the e_flags from the code object you are producing? What GPU and code object version are you targeting? What kind of GPU do you have on the system? You can see how to decode the flags at https://llvm.org/docs/AMDGPUUsage.html#header .

zyx-billy · 2025-02-14T19:32:02Z

Thanks for the response. Right now we're using a pretty up to date version of main (ea6827c as of 4 days ago).

Here's the gpu on our system from HSA_XNACK=1 rocminfo | grep Name:

  ...
  Name:                    gfx942                             
  Marketing Name:          AMD Instinct MI300X                
  Vendor Name:             AMD                                
      Name:                    amdgcn-amd-amdhsa--gfx942:sramecc+:xnack+

Our code is targeting gfx942 with code object version 5. This is the top of our amdgcn:

	.amdgcn_target "amdgcn-amd-amdhsa--gfx942:xnack+"
	.amdhsa_code_object_version 5

And the e_flags from our code object is 0x74c, which seems to be correct?

ELF Header:
  Magic:   7f 45 4c 46 02 01 01 40 03 00 00 00 00 00 00 00 
  Class:                             ELF64
  Data:                              2's complement, little endian
  Version:                           1 (current)
  OS/ABI:                            <unknown: 40>
  ABI Version:                       3
  Type:                              DYN (Shared object file)
  Machine:                           AMD GPU
  Version:                           0x1
  Entry point address:               0x0
  Start of program headers:          64 (bytes into file)
  Start of section headers:          120192 (bytes into file)
  Flags:                             0x74c
  Size of this header:               64 (bytes)
  Size of program headers:           56 (bytes)
  Number of program headers:         9
  Size of section headers:           64 (bytes)
  Number of section headers:         20
  Section header string table index: 18

BTW, in case sramecc mattered, I added it to our target features when compiling (this resulted in e_flags of 0xf4c), but it also ran into the same error.

One random thought, is it possible the .bc files I linked are somehow incompatible? FWIW, linking them but not running the asan pass also results in the same error (or maybe that's expected behavior). If I don't link them, and also don't run the asan pass, the program runs fine.

CRobeck · 2025-02-14T19:36:24Z

A couple thoughts about what it might be:

asanrtl.bc, since it is being pull directly that Triton PR, is not compiled for the correct target or from the wrong version of LLVM. If you did pull those .bc files directly from my Triton PR can you instead grab them from the ROCM/LLVM asan dir?
asanrtl.bc is not being linked in correctly to the generated binary (i.e. it's not being passed down into linker correctly). You'd have to build/run a debug version of the runtime probably to see this.

Are you JIT compiling this? Do you have a .bc for the actual kernel and can you try merging/linking those bc files and see if you get a clash from llvm-as?

zyx-billy · 2025-02-14T20:15:09Z

Oh yes I got the .bc files from the rocm sdk on our system. Is it possible that's too old still? I can try a newer release.

/opt/rocm-6.2.0/lib/llvm/lib/clang/18/lib/amdgcn/bitcode/asanrtl.bc

I'm compiling this ahead of time. Basically I have the in-memory LLVM IR of our kernel, and I used the exact same linking logic from the triton impl here on my IR module. Then I ran it through llvm optimizations and backend lowering passes to get the object file.

b-sumner · 2025-02-14T20:27:52Z

Can you see any more information about the failure when setting environment AMD_LOG_LEVEL=2?

Is the LLVM version you're using to create code objects 18 or something newer?

Can you move to ROCm 6.3.2?

CRobeck · 2025-02-14T20:33:02Z

Can you also double check the fields in the attributes of the .ll IR file? You should see something like:
attributes #0 = { ..."target-features"="+xnack" }
attributes #1 = {...sanitize_address }
you should be able to just llvm-dis your combined .bc kernel file.

zyx-billy · 2025-02-14T22:36:55Z

hmm, the only additional output I get with AMD_LOG_LEVEL=2 is

:1:hip_fatbin.cpp           :91  : 5535792822252 us: [pid:164883 tid:0x7fe5fba447c0] All Unique FDs are closed

And yes I see these attributes on our kernel in the combined IR:

attributes #0 = { ... sanitize_address "target-cpu"="gfx942" "target-features"="+xnack"}

I tried linking in the .bc files after optimization passes instead, but it didn't make a difference.

The LLVM I'm using is very recent (< 1 week old on main). I'll retry with the latest ROCm release.

b-sumner · 2025-02-14T23:47:50Z

The closer you can get the device library to the compiler you're using the better. @CRobeck and I have seen this before elsewhere, but I'm not clear on exactly what cleared it up then.

zyx-billy · 2025-02-14T23:56:57Z

Unfortunately I get the same error with 6.3.2 (and there's also no additional output under AMD_LOG_LEVEL=2 anymore). Though it looks like the updated asanrtl.bc library is also created with clang 18 (the contents of the library does differ).

b-sumner · 2025-02-15T00:55:26Z

Does environment LOADER_ENABLE_LOGGING=1 give any additional output?

zyx-billy · 2025-02-18T20:21:43Z

oh amazing! Just what I was looking for. It gives:

LoaderError: symbol "__oclc_ABI_version" is undefined

And indeed all I see is this in our combined IR:

@__oclc_ABI_version = external local_unnamed_addr addrspace(4) constant i32, align 4

The same goes for __oclc_ISA_version and __oclc_wavefrontsize64.

I looked around and found that these values need to be set onto the IR. When I manually added them by linking to the relevant .bc files that came with the install (e.g. oclc_abi_version_500.bc), I no longer get any errors loading the kernel! (For posterity, when lowering through the standard MLIR path, I found the LLVM/ROCDL target lowering logic that does this).

Testing with a correct program, it runs to completion without errors. Testing with an out-of-bounds array access, I get an asan report correctly (with debuginfo interpreted correctly). Thank you for all of your help 🙏 ! This has been immensely helpful.

zyx-billy · 2025-02-18T20:26:41Z

Oh and btw, was able to confirm this works on 6.2.0 too. Closing the issue then. Thank you! 🙏

llvmbot added the new issue label Feb 14, 2025

EugeneZelenko added compiler-rt:asan Address sanitizer and removed new issue labels Feb 14, 2025

zyx-billy closed this as completed Feb 18, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[AMDGPU] error running program when compiled with asan #127241

[AMDGPU] error running program when compiled with asan #127241

zyx-billy commented Feb 14, 2025

EugeneZelenko commented Feb 14, 2025

searlmc1 commented Feb 14, 2025

b-sumner commented Feb 14, 2025

zyx-billy commented Feb 14, 2025 •

edited

Loading

CRobeck commented Feb 14, 2025 •

edited

Loading

zyx-billy commented Feb 14, 2025

b-sumner commented Feb 14, 2025

CRobeck commented Feb 14, 2025 •

edited

Loading

zyx-billy commented Feb 14, 2025

b-sumner commented Feb 14, 2025

zyx-billy commented Feb 14, 2025

b-sumner commented Feb 15, 2025

zyx-billy commented Feb 18, 2025

zyx-billy commented Feb 18, 2025

[AMDGPU] error running program when compiled with asan #127241

[AMDGPU] error running program when compiled with asan #127241

Comments

zyx-billy commented Feb 14, 2025

EugeneZelenko commented Feb 14, 2025

searlmc1 commented Feb 14, 2025

b-sumner commented Feb 14, 2025

zyx-billy commented Feb 14, 2025 • edited Loading

CRobeck commented Feb 14, 2025 • edited Loading

zyx-billy commented Feb 14, 2025

b-sumner commented Feb 14, 2025

CRobeck commented Feb 14, 2025 • edited Loading

zyx-billy commented Feb 14, 2025

b-sumner commented Feb 14, 2025

zyx-billy commented Feb 14, 2025

b-sumner commented Feb 15, 2025

zyx-billy commented Feb 18, 2025

zyx-billy commented Feb 18, 2025

zyx-billy commented Feb 14, 2025 •

edited

Loading

CRobeck commented Feb 14, 2025 •

edited

Loading

CRobeck commented Feb 14, 2025 •

edited

Loading