Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ucx_perftest binary missing linking information #12

Open
jameslamb opened this issue Oct 18, 2024 · 1 comment
Open

ucx_perftest binary missing linking information #12

jameslamb opened this issue Oct 18, 2024 · 1 comment
Assignees
Labels
bug Something isn't working

Comments

@jameslamb
Copy link
Member

jameslamb commented Oct 18, 2024

Description

UCX provides a CLI, ucx_perftest, for running performance tests (example from UCX docs).

While investigating rapidsai/ucx-py#1072, @pentschev attempted to use that tool bundled in the wheels produced here, and found that it segfaulted immediately. The root cause looked to be missing linking information.

In #11, removing this invocation of auditwheel repair appeared to leave that linking in place:

python -m auditwheel repair -w ${package_dir}/final_dist --exclude "libcuda.so.1" --exclude "libnvidia-ml.so.1" --exclude "libucm.so.0" --exclude "libuct.so.0" --exclude "libucs.so.0" --exclude "libucp.so.0" ${package_dir}/dist/*

And that change alone allowed ucx_perftest to execute successfully 🎉

That should be investigated, and changes might be required for the build here.

Reproducible Example

On an x86_64 system with CUDA 12.2

pip install 'libucx-cu12==1.17.0'
SITE_PACKAGES=$(python -c "import site; print(site.getsitepackages()[0])")

${SITE_PACKAGES}/libucx/bin/ucx_perftest
# Segmentation fault (core dumped)

ldd "${SITE_PACKAGES}/libucx/bin/ucx_perftest"
# (empty)

Notes

Some relevant notes in the OpenUCX docs:

@jameslamb jameslamb added the bug Something isn't working label Oct 18, 2024
@jameslamb jameslamb self-assigned this Oct 18, 2024
@jameslamb
Copy link
Member Author

Alright, I don't have a full explanation and suggested fix yet, but stopping to put up some notes.

So I can see here, interactively, that the linking information looks correct at the end of the wheel build and ucx_perftest appears to be doing what we want. Something auditwheel repair is doing is leaving it in a bad state (just as @pentschev found on #11).

full code to build and unpack wheel (click me)
# get auditwheel source (used later in debugging)
git clone \
    git@github.com:pypa/auditwheel.git \
    ./auditwheel-src

docker run \
    --rm \
    -v $(pwd):/opt/work \
    -w /opt/work \
    -it rapidsai/ci-wheel:cuda12.2.2-rockylinux8-py3.11 \
    bash

rm -rf ./dist
rm -rf ./final_dist
rm -rf ./unzipped_contents
rm -rf ./unzipped-post-auditwheel

pip uninstall --yes auditwheel
pip install -e ./auditwheel-src

# move to a different directory not mounted in, to avoid those annoying docker 'permission denied'
# issues when files are changed by the build process
cp -R $(pwd) /tmp/ucx-wheels
cd /tmp/ucx-wheels/python/libucx

python -m pip wheel \
    -w dist \
    -v \
    --no-deps \
    --disable-pip-version-check \
    .

mkdir -p ./unzipped-contents
unzip \
    ./dist/libucx*.whl \
    -d ./unzipped-contents
mkdir -p ./unzipped-contents
unzip \
    ./dist/libucx*.whl \
    -d ./unzipped-contents

ldd ./unzipped-contents/libucx/bin/ucx_perftest
ldd output (click me)
        linux-vdso.so.1 (0x00007f922817b000)
        libucp.so.0 => /tmp/ucx-wheels/python/libucx/build/lib/libucx/lib/libucp.so.0 (0x00007f922807f000)
        libuct.so.0 => /tmp/ucx-wheels/python/libucx/build/lib/libucx/lib/libuct.so.0 (0x00007f9228036000)
        libucs.so.0 => /tmp/ucx-wheels/python/libucx/build/lib/libucx/lib/libucs.so.0 (0x00007f9227fbd000)
        libm.so.6 => /lib64/libm.so.6 (0x00007f9227bcb000)
        libucm.so.0 => /tmp/ucx-wheels/python/libucx/build/lib/libucx/lib/libucm.so.0 (0x00007f9227f97000)
        libdl.so.2 => /lib64/libdl.so.2 (0x00007f92279c7000)
        libpthread.so.0 => /lib64/libpthread.so.0 (0x00007f92277a7000)
        librt.so.1 => /lib64/librt.so.1 (0x00007f922759f000)
        libgomp.so.1 => /lib64/libgomp.so.1 (0x00007f9227367000)
        libc.so.6 => /lib64/libc.so.6 (0x00007f9226f91000)
        /lib64/ld-linux-x86-64.so.2 (0x00007f9227f4d000)
objdump -x  ./unzipped-contents/libucx/bin/ucx_perftest | grep PATH
#   RUNPATH              /tmp/ucx-wheels/python/libucx/build/lib/libucx/lib

And ucx_perftest appears to load and run successfully.

./unzipped-contents/libucx/bin/ucx_perftest

# [1729282637.710255] [ea8c4a832eab:23962:0]        perftest.c:793  UCX  WARN  CPU affinity is not set (bound to 80 cpus). 
# Performance may be impacted.
# Waiting for connection...

I installed auditwheel in editable mode so I could fiddle around with it (dropping into a debugger, adding print statements, etc.). I patched it to print out every system command it's running.

that patch (click me)
diff --git a/src/auditwheel/patcher.py b/src/auditwheel/patcher.py
index 67367c9..1baca3c 100644
--- a/src/auditwheel/patcher.py
+++ b/src/auditwheel/patcher.py
@@ -3,7 +3,13 @@ from __future__ import annotations
 import re
 from itertools import chain
 from shutil import which
-from subprocess import CalledProcessError, check_call, check_output
+from subprocess import CalledProcessError, check_call as subpr_check_call, check_output
+
+
+def check_call(args: list):
+    arg_str = " ".join(args)
+    print(f"(command) '{arg_str}'")
+    subpr_check_call(args)
 
 
 class ElfPatcher:
diff --git a/src/auditwheel/repair.py b/src/auditwheel/repair.py
index 85e3ca3..0723c6b 100644
--- a/src/auditwheel/repair.py
+++ b/src/auditwheel/repair.py
@@ -10,7 +10,7 @@ import stat
 from os.path import abspath, basename, dirname, exists, isabs
 from os.path import join as pjoin
 from pathlib import Path
-from subprocess import check_call
+from subprocess import check_call as subpr_check_call
 from typing import Iterable
 
 from auditwheel.patcher import ElfPatcher
@@ -23,6 +23,10 @@ from .wheeltools import InWheelCtx, add_platforms
 
 logger = logging.getLogger(__name__)
 
+def check_call(args: list):
+    arg_str = " ".join(args)
+    print(f"(command) '{arg_str}'")
+    subpr_check_call(args)
 
 # Copied from wheel 0.31.1
 WHEEL_INFO_RE = re.compile(

Then ran it just as it's run in CI, but redirecting the output to a file.

code to do that (click me)
python -m auditwheel -vvv repair \
    -w final_dist \
    --exclude "libcuda.so.1" \
    --exclude "libnvidia-ml.so.1" \
    --exclude "libucm.so.0" \
    --exclude "libuct.so.0" \
    --exclude "libucs.so.0" \
    --exclude "libucp.so.0" \
    dist/* \
> /opt/work/auditwheel.txt 2>&1

From that, I see that auditwheel repair is running the following:

patchelf --set-soname libgomp-24e2ab19.so.1.0.0 libucx_cu12.libs/libgomp-24e2ab19.so.1.0.0
patchelf --replace-needed libgomp.so.1 libgomp-24e2ab19.so.1.0.0 libucx/bin/ucx_perftest
patchelf --remove-rpath /tmp/tmp8v5ujsmi/libucx/bin/ucx_perftest
patchelf --force-rpath --set-rpath $ORIGIN/../../libucx_cu12.libs /tmp/tmp8v5ujsmi/libucx/bin/ucx_perftest

Which then leaves ucx_perftest looking like this:

pip install ./final_dist/*.whl
SITE_PACKAGES=$(python -c "import site; print(site.getsitepackages()[0])")
objdump -x "${SITE_PACKAGES}/libucx/bin/ucx_perftest" | grep PATH
#   RPATH                $ORIGIN/../../libucx_cu12.libs

Notice that libucx_cu12.libs? That's a problem.... that directory doesn't exist.

That's a default from auditwheel. The default settings for auditwheel repair assume that all the shared libraries included in the wheel will be at {distribution_name}.libs.

It's possible to change the .libs part to something else via the -L / --lib-sdir argument, but not the {distribution_name} ... that's read directly from the wheel's metadata.

Like this:

match = WHEEL_INFO_RE(wheel_fname)
dest_dir = match.group("name") + lib_sdir

(auditwheel code link)

We want the directory in site-packages/ to always be libucx/ (no CUDA suffix) regardless of whether libucx-cu11 or libucx-cu12 was installed, so downstream users like ucxx can unconditionally do something like this:

import libucx
libucx.load_library()

(ucxx code link)

That's customized here:

install_prefix = os.path.abspath(os.path.join(self.build_lib, "libucx"))

subprocess.run(["./autogen.sh"])
subprocess.run(
[
"./contrib/configure-release",
f"--prefix={install_prefix}",

We have, for example, wheels here called libucx-cu12 (normalized to libucx_cu12 in site-packages/) which populate site-packages/libucx when installed.

I tried patching that installed CLI after the fact... did not work.

patchelf --print-rpath "${SITE_PACKAGES}/libucx/bin/ucx_perftest"
# $ORIGIN/../../libucx_cu12.libs

patchelf --remove-rpath "${SITE_PACKAGES}/libucx/bin/ucx_perftest"
patchelf --print-rpath "${SITE_PACKAGES}/libucx/bin/ucx_perftest"
# (empty)

patchelf --force-rpath --set-rpath '$ORIGIN/../lib' "${SITE_PACKAGES}/libucx/bin/ucx_perftest"
# Assertion failed: splitIndex != -1 (patchelf.cc: shiftFile: 504)
# Aborted (core dumped)

patchelf --set-rpath '$ORIGIN/../lib' "${SITE_PACKAGES}/libucx/bin/ucx_perftest"
# Assertion failed: splitIndex != -1 (patchelf.cc: shiftFile: 504)
# Aborted (core dumped)

patchelf --add-rpath '$ORIGIN/../../lib' "${SITE_PACKAGES}/libucx/bin/ucx_perftest"
# Assertion failed: splitIndex != -1 (patchelf.cc: shiftFile: 504)
# Aborted (core dumped)

And that's where I'm stuck at right now. ucx_perftest absolute is an ELF-format binary, so I'm not sure how even patchelf is segfaulting 😬

file  "${SITE_PACKAGES}/libucx/bin/ucx_perftest"
/pyenv/versions/3.11.10/lib/python3.11/site-packages/libucx/bin/ucx_perftest:
ELF 64-bit LSB executable, x86-64, version 1 (SYSV), dynamically linked, interpreter /lib64/ld-linux-x86-64.so.2, 
BuildID[sha1]=95a1fecd0e621296d4ab577c75fd66c34c8138d5,
for GNU/Linux 3.2.0, with debug_info, not stripped

I've attached the full auditwheel logs here (as a file attachment, because it's large:

auditwheel-logs.txt

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant