Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adapting to the system hwloc situation #73

Open
inducer opened this issue Aug 17, 2020 · 20 comments · Fixed by illinois-ceesd/mirgecom#169
Open

Adapting to the system hwloc situation #73

inducer opened this issue Aug 17, 2020 · 20 comments · Fixed by illinois-ceesd/mirgecom#169

Comments

@inducer
Copy link
Contributor

inducer commented Aug 17, 2020

It seems that we have two possible situations that can give us the annoying:

Choose platform:
[0] <pyopencl.Platform 'Portable Computing Language' at 0x7f0979c86008>
Choice [0]:
Not enough memory to run on this device.
[1]    150038 abort (core dumped)  python wave-eager.py
  • System MPI uses libhwloc 1 (Livermore), pocl uses libhwloc 2
  • System MPI uses libhwloc 2 (Debian, e.g. Andreas's machine, scicomp cluster), pocl uses libhwloc1

As of #71, we default to installing libhwloc 1, but that's also not a safe default. Is there something we can do to automate installing the correct libhwloc?

This snippet will get the current hwloc version, using either Python 2 or 3:

import ctypes
hwloc = ctypes.cdll.LoadLibrary("libhwloc.so.15")
# https://github.com/open-mpi/hwloc/blob/master/include/hwloc.h
hwloc.hwloc_get_api_version.restype = ctypes.c_uint
hwloc.hwloc_get_api_version.argtypes = []
print(hwloc.hwloc_get_api_version() >> 16)

I'm just not sure that using this is a great idea...

cc @matthiasdiener

@inducer
Copy link
Contributor Author

inducer commented Dec 4, 2020

I just hit this again. @matthiasdiener, what do you think of the detection solution I proposed?

@matthiasdiener
Copy link
Member

See illinois-ceesd/mirgecom#169 for another workaround.

@inducer
Copy link
Contributor Author

inducer commented Mar 10, 2021

Hm. I just saw this again, but I realized that my system does not benefit from illinois-ceesd/mirgecom#169. Among other things, this breaks running the meshmode tests. I don't think it's feasible to apply this workaround universally. This makes me like the hwloc version picking hackery idea more.

@matthiasdiener What do you think?

@majosm
Copy link
Contributor

majosm commented Apr 8, 2021

@inducer
Copy link
Contributor Author

inducer commented Apr 9, 2021

Based on @majosm's experience, this has the potential to cause spurious breakage on Macs and Linux, except the Mac failures are even worse: all you get is a segfault, not even an error message you can search. I don't think I'd consider this fixed, even if the import order can be used to work around it. (@majosm, could you conda install libhwloc=2 and see if the import order thing even also applies on Macs?)

IMO, the emirge install script should make an effort to set things up so as to avoid this.

The main wrinkle with the script snippet above is that we need some way of finding the hwloc shared library.

@majosm
Copy link
Contributor

majosm commented Apr 9, 2021

(@majosm, could you conda install libhwloc=2 and see if the import order thing even also applies on Macs?)

With conda install libhwloc=2, I get a segfault if I load pyopencl then mpi4py, but not if I load mpi4py then pyopencl.

@inducer
Copy link
Contributor Author

inducer commented Apr 10, 2021

Thanks for checking! So it's probably the same bug, just different crashes. I don't think it's plausible that we'll catch all the code where things get installed in the "wrong" order. And I can't reasonably add a warning to PyOpenCL either...

@majosm Does the above snippet (or some version of it) successfully detect the "ambient" (MPI) version of hwloc?

@majosm
Copy link
Contributor

majosm commented Apr 19, 2021

@majosm Does the above snippet (or some version of it) successfully detect the "ambient" (MPI) version of hwloc?

Seems like it detects the conda hwloc. When I run:

import ctypes
hwloc = ctypes.cdll.LoadLibrary("libhwloc.dylib")
# https://github.com/open-mpi/hwloc/blob/master/include/hwloc.h
hwloc.hwloc_get_api_version.restype = ctypes.c_uint
hwloc.hwloc_get_api_version.argtypes = []
print(hwloc.hwloc_get_api_version() >> 16)

it prints 2. After conda install libhwloc=1, it prints 1.

@inducer
Copy link
Contributor Author

inducer commented Apr 19, 2021

But what does it do prior to any conda env being active? The whole point would be for it to detect what hwloc exists in the environment and adapt to that, to avoid conflicts.

@majosm
Copy link
Contributor

majosm commented Apr 19, 2021

After activating conda and prior to conda activate <env-name>, I still get 2. If I don't activate conda at all I get

  File "find_hwloc.py", line 2, in <module>
    hwloc = ctypes.cdll.LoadLibrary("libhwloc.dylib")
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/ctypes/__init__.py", line 444, in LoadLibrary
    return self._dlltype(name)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/ctypes/__init__.py", line 366, in __init__
    self._handle = _dlopen(self._name, mode)
OSError: dlopen(libhwloc.dylib, 6): no suitable image found.  Did find:
	file system relative paths not allowed in hardened programs

@inducer
Copy link
Contributor Author

inducer commented Apr 19, 2021

Huh. I'm guessing that's the system Python not allowing you to load dynamic libraries? That's annoying.

@inducer
Copy link
Contributor Author

inducer commented Apr 19, 2021

I guess that's not a big issue: We could conceivably just run that code in the conda base environment (right after miniconda completes). FWIW, "2" is the answer we want in that case, right?

@majosm
Copy link
Contributor

majosm commented Apr 19, 2021

I guess that's not a big issue: We could conceivably just run that code in the conda base environment (right after miniconda completes). FWIW, "2" is the answer we want in that case, right?

Unfortunately not. 🙁 System hwloc is v1.

Edit: It's actually spack hwloc. Let me try the last experiment again, I might not have had my spack packages loaded up when I tried.

Edit 2: Same result with spack packages loaded. Same result for system python, but with the base conda env I now get 1. Interesting.

@inducer
Copy link
Contributor Author

inducer commented Apr 19, 2021

but with the base conda env I now get 1. Interesting.

Despite the spack hwloc being v2?

@majosm
Copy link
Contributor

majosm commented Apr 19, 2021

but with the base conda env I now get 1. Interesting.

Despite the spack hwloc being v2?

Spack hwloc is v1 (my MPI is installed via spack and uses that version). conda is v2.

@matthiasdiener
Copy link
Member

As another (easier?) workaround, could we default to installing mpich instead of openmpi? mpich does not depend on hwloc, and it might be preferable in any case (see e.g. illinois-ceesd/mirgecom@7a07799)

@inducer
Copy link
Contributor Author

inducer commented Apr 19, 2021

AFAIK, we're not installing any MPI implementation ATM, and that seems like the right approach. (i.e. use whatever mpicc is)

@matthiasdiener
Copy link
Member

matthiasdiener commented Apr 19, 2021

AFAIK, we're not installing any MPI implementation ATM, and that seems like the right approach. (i.e. use whatever mpicc is)

I mean, recommending people to install mpich instead of openmpi on their machines (as well as CI etc.)

Edit:
This is considering my belief that the root of the issue is that mpi4py gets built against some system MPI and pulls in whatever hwloc version openmpi (or its offspring) is built against.

@isuruf
Copy link

isuruf commented May 24, 2021

I think this issue only happens if libhwloc=2 gets loaded before libhwloc=1. If you always install pocl with libhwloc=1 and load it first (by running cl.get_platforms()), this should go away.

@inducer
Copy link
Contributor Author

inducer commented May 24, 2021

I'm not sure we have reliable control over what gets loaded first though.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants