Skip to content

Bug when crawling all process #1

@mawad-amd

Description

@mawad-amd

If we don't add code objects on demand and let KernelDB crawls the process, we end up with the following error:

Traceback (most recent call last):
  File "/work1/amd/muhaawad/git/amd/audacious/maestro/examples/python/add.py", line 13, in <module>
    A = torch.randn(N, device=device)
  File "/usr/local/lib/python3.10/dist-packages/torch/cuda/__init__.py", line 319, in _lazy_init
    torch._C._cuda_init()
RuntimeError: std::bad_alloc

Code

#!/usr/bin/env python3

import torch

# Ensure we're using ROCm and the GPU
assert torch.version.hip is not None, "This script requires ROCm."
device = torch.device("cuda")

# Define vector size
N = 1024

# Initialize vectors A and B
A = torch.randn(N, device=device)
B = torch.randn(N, device=device)

# Perform vector addition: C = A + B
C = A + B

# Optional: verify on CPU
A_cpu = A.cpu()
B_cpu = B.cpu()
C_ref = A_cpu + B_cpu

assert torch.allclose(C.cpu(), C_ref, atol=1e-5)

print("Vector addition completed successfully on ROCm GPU.")

Log

[INFO]: [src/nexus.cpp:81] NEXUS_PIPE_NAME is not set. Set it to communicate with driver script.
Adding /usr/bin/python3.10
Adding linux-vdso.so.1
Adding /lib/x86_64-linux-gnu/libm.so.6
Adding /lib/x86_64-linux-gnu/libexpat.so.1
Adding /lib/x86_64-linux-gnu/libz.so.1
Adding /lib/x86_64-linux-gnu/libc.so.6
Adding /lib64/ld-linux-x86-64.so.2
Adding /usr/lib/python3.10/lib-dynload/_ctypes.cpython-310-x86_64-linux-gnu.so
Adding /lib/x86_64-linux-gnu/libffi.so.8
Adding /usr/lib/python3.10/lib-dynload/_opcode.cpython-310-x86_64-linux-gnu.so
Adding /usr/lib/python3.10/lib-dynload/_bz2.cpython-310-x86_64-linux-gnu.so
Adding /lib/x86_64-linux-gnu/libbz2.so.1.0
Adding /usr/lib/python3.10/lib-dynload/_lzma.cpython-310-x86_64-linux-gnu.so
Adding /lib/x86_64-linux-gnu/liblzma.so.5
Adding /usr/lib/python3.10/lib-dynload/_json.cpython-310-x86_64-linux-gnu.so
Adding /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_global_deps.so
Adding /lib/x86_64-linux-gnu/libpthread.so.0
Adding /lib/x86_64-linux-gnu/libdl.so.2
Adding /usr/local/lib/python3.10/dist-packages/torch/lib/libgomp.so
Adding /usr/local/lib/python3.10/dist-packages/torch/_C.cpython-310-x86_64-linux-gnu.so
Adding /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_python.so
Adding /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch.so
Adding /usr/local/lib/python3.10/dist-packages/torch/lib/libshm.so
Adding /usr/local/lib/python3.10/dist-packages/torch/lib/libroctx64.so
Adding /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cpu.so
Adding /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_hip.so
Adding /usr/local/lib/python3.10/dist-packages/torch/lib/libc10_hip.so
Adding /usr/local/lib/python3.10/dist-packages/torch/lib/libMIOpen.so
You're adding kernel ".text" which we've seen before. Something may be wrong.
Adding /usr/local/lib/python3.10/dist-packages/torch/lib/libhipblaslt.so
Adding /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so
Adding /usr/local/lib/python3.10/dist-packages/torch/lib/libamdhip64.so
Adding /lib/x86_64-linux-gnu/libstdc++.so.6
Adding /lib/x86_64-linux-gnu/libgcc_s.so.1
Adding /lib/x86_64-linux-gnu/librt.so.1
Adding /usr/local/lib/python3.10/dist-packages/torch/lib/libroctracer64.so
Adding /usr/local/lib/python3.10/dist-packages/torch/lib/libhiprtc.so
Adding /usr/local/lib/python3.10/dist-packages/torch/lib/libhipblas.so
Adding /usr/local/lib/python3.10/dist-packages/torch/lib/libhipfft.so
Adding /usr/local/lib/python3.10/dist-packages/torch/lib/libhiprand.so
Adding /usr/local/lib/python3.10/dist-packages/torch/lib/libhipsparse.so
Adding /usr/local/lib/python3.10/dist-packages/torch/lib/libhipsolver.so
Adding /usr/local/lib/python3.10/dist-packages/torch/lib/libaotriton_v2.so
Adding /usr/local/lib/python3.10/dist-packages/torch/lib/librccl.so
You're adding kernel ".text" which we've seen before. Something may be wrong.
Adding /usr/local/lib/python3.10/dist-packages/torch/lib/libmagma.so
Adding /lib/x86_64-linux-gnu/libzstd.so.1
Adding /usr/local/lib/python3.10/dist-packages/torch/lib/libamd_comgr.so
Adding /usr/local/lib/python3.10/dist-packages/torch/lib/librocm-core.so
Adding /usr/local/lib/python3.10/dist-packages/torch/lib/librocblas.so
You're adding kernel ".text" which we've seen before. Something may be wrong.
Adding /usr/local/lib/python3.10/dist-packages/torch/lib/libnuma.so
Adding /usr/local/lib/python3.10/dist-packages/torch/lib/librocprofiler-register.so
Adding /usr/local/lib/python3.10/dist-packages/torch/lib/libhsa-runtime64.so
Adding /usr/local/lib/python3.10/dist-packages/torch/lib/librocsolver.so
You're adding kernel ".text" which we've seen before. Something may be wrong.
Adding /usr/local/lib/python3.10/dist-packages/torch/lib/librocfft.so
Adding /usr/local/lib/python3.10/dist-packages/torch/lib/librocrand.so
You're adding kernel ".text" which we've seen before. Something may be wrong.
Adding /usr/local/lib/python3.10/dist-packages/torch/lib/librocsparse.so
You're adding kernel ".text" which we've seen before. Something may be wrong.
You're adding kernel "init_kernel()" which we've seen before. Something may be wrong.
Adding /usr/local/lib/python3.10/dist-packages/torch/lib/libsuitesparseconfig.so
Adding /usr/local/lib/python3.10/dist-packages/torch/lib/libcholmod.so
Adding /usr/local/lib/python3.10/dist-packages/torch/lib/librocm_smi64.so
Adding /usr/local/lib/python3.10/dist-packages/torch/lib/libtinfo.so
Adding /usr/local/lib/python3.10/dist-packages/torch/lib/libelf.so
Adding /usr/local/lib/python3.10/dist-packages/torch/lib/libdrm.so
Adding /usr/local/lib/python3.10/dist-packages/torch/lib/libdrm_amdgpu.so
Adding /usr/local/lib/python3.10/dist-packages/torch/lib/libsatlas.so
Adding /usr/local/lib/python3.10/dist-packages/torch/lib/libamd.so
Adding /usr/local/lib/python3.10/dist-packages/torch/lib/libcamd.so
Adding /usr/local/lib/python3.10/dist-packages/torch/lib/libcolamd.so
Adding /usr/local/lib/python3.10/dist-packages/torch/lib/libccolamd.so
Adding /usr/local/lib/python3.10/dist-packages/torch/lib/libgfortran.so
Adding /usr/local/lib/python3.10/dist-packages/torch/lib/libquadmath.so
Adding /usr/local/lib/python3.10/dist-packages/numpy/_core/_multiarray_umath.cpython-310-x86_64-linux-gnu.so
Adding /usr/local/lib/python3.10/dist-packages/numpy/_core/../../numpy.libs/libscipy_openblas64_-6bb31eeb.so
Adding /usr/local/lib/python3.10/dist-packages/numpy/_core/../../numpy.libs/libgfortran-040039e1-0352e75f.so.5.0.0
Adding /usr/local/lib/python3.10/dist-packages/numpy/_core/../../numpy.libs/libquadmath-96973f99-934c22de.so.0.0.0
Adding /usr/lib/python3.10/lib-dynload/_contextvars.cpython-310-x86_64-linux-gnu.so
Adding /usr/local/lib/python3.10/dist-packages/numpy/linalg/_umath_linalg.cpython-310-x86_64-linux-gnu.so
Adding /usr/lib/python3.10/lib-dynload/mmap.cpython-310-x86_64-linux-gnu.so
Adding /usr/local/lib/python3.10/dist-packages/amdsmi/libamd_smi.so
Adding /usr/lib/python3.10/lib-dynload/_ssl.cpython-310-x86_64-linux-gnu.so
Adding /lib/x86_64-linux-gnu/libssl.so.3
Adding /lib/x86_64-linux-gnu/libcrypto.so.3
Adding /usr/lib/python3.10/lib-dynload/_asyncio.cpython-310-x86_64-linux-gnu.so
Adding /usr/lib/python3.10/lib-dynload/_queue.cpython-310-x86_64-linux-gnu.so
Adding /usr/lib/python3.10/lib-dynload/_hashlib.cpython-310-x86_64-linux-gnu.so
Adding /usr/lib/python3.10/lib-dynload/_uuid.cpython-310-x86_64-linux-gnu.so
Adding /lib/x86_64-linux-gnu/libuuid.so.1
Adding /usr/lib/python3.10/lib-dynload/_multiprocessing.cpython-310-x86_64-linux-gnu.so
Adding /work1/amd/muhaawad/git/amd/audacious/maestro/external/nexus/build/lib/libnexus.so
Adding /work1/amd/muhaawad/git/amd/audacious/maestro/external/nexus/build/_deps/kerneldb-build/libkernelDB64.so.1
[INFO]: [src/nexus.cpp:101] Found 2599 kernels
[INFO]: [src/nexus.cpp:107] Kernel: .text
[INFO]: [src/nexus.cpp:108] Number of lines: 0
[INFO]: [src/nexus.cpp:107] Kernel: BytePack<4> Apply_Reduce<FuncMinMax<rccl_bfloat8>, 4>::reduce<4>(FuncMinMax<rccl_bfloat8>, BytePack<4>, BytePack<4>)
[INFO]: [src/nexus.cpp:108] Number of lines: 2
[INFO]: [src/nexus.cpp:113] hipify/src/device/reduce_kernel.h:149 ->   buffer_store_dword v11, off, s[0:3], s32                   // 0000002758C4: E0700000 20000B00 
[INFO]: [src/nexus.cpp:113] hipify/src/device/reduce_kernel.h:152 ->   buffer_load_dword v11, off, s[0:3], s32                    // 000000277118: E0500000 20000B00 
Ending kernelDB
Found 2599 kernels.
Traceback (most recent call last):
  File "/work1/amd/muhaawad/git/amd/audacious/maestro/examples/python/add.py", line 13, in <module>
    A = torch.randn(N, device=device)
  File "/usr/local/lib/python3.10/dist-packages/torch/cuda/__init__.py", line 319, in _lazy_init
    torch._C._cuda_init()
RuntimeError: std::bad_alloc
 

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions