Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add option to use precompiled headers for faster builds #100

Merged
merged 1 commit into from
Jul 5, 2024

Conversation

chillenb
Copy link
Contributor

@chillenb chillenb commented Jul 4, 2024

Three things are done in this PR.

  1. The executable, shared library, and Python module are now different CMake targets. Name conflicts are avoided by setting the OUTPUT_NAME property to the correct value. Currently, this should not change anything for users (if it does, that's a mistake). However, with a few more changes to CMakeLists.txt, it should be possible to build multiple targets at once (e.g. the python module and executable together).

  2. Add an option (-DUSE_PCH=ON) to precompile the headers in src/instantiation/*.hpp. Precompiling these headers can make builds faster if the user has a compiler that supports it. It's disabled by default.

  3. In setup.py, the default number of build processes is still 2, but the user can override it by setting the environment variable CMAKE_BUILD_PARALLEL_LEVEL.

Let me know if there are points I overlooked which need to be addressed, or if you have other suggestions / concerns!

@hczhai
Copy link
Contributor

hczhai commented Jul 4, 2024

Hi, thanks for the contribution.

block2 is a relatively mature code so it is important to make changes only when necessary, and keep backward compatibility, even when our dependences break it.

However, with a few more changes to CMakeLists.txt, it should be possible to build multiple targets at once

The changes look ok. Is this really necessary (namely, do we have a user who will need two different executables/libraries for their applications)? Will this be advantageous than just running cmake twice? The current solution works perfectly for my purpose.

Precompiling these headers can make builds faster if the user has a compiler that supports it. It's disabled by default.

  1. Precompiling the headers may affect the parallelization of the jobs.
  2. Long time ago my tests showed that there is no big speed improvement by precompiling headers. If you observed any improvement due to this change please let me know.
  3. target_precompile_headers requires cmake 3.16.

In setup.py, the default number of build processes is still 2, but the user can override it by setting the environment variable CMAKE_BUILD_PARALLEL_LEVEL.

This seems to be okay at the moment, but are we promoting a good practice or a bad practice by doing this? From the users' perspective, pip install block2 can take seconds (if the OS and Python version are supported) or hours (if it grabs the source code and tries to compile it). The second situation is disfavored, because block2 is a C++ heavy project with many compiling options. And there is no easy way to adjust these options through setup.py. Yes you can add this environment variables for parallel level, but how about other cmake options? From the "saving compiling time" perspective, disabling some options like USE_SINGLE_PREC and USE_COMPLEX should always be more practical than changing the parallel level.

If you look at the documentation of block2, for "manual installation", the suggested way is to create a "build" directory and run cmake manually. The setup.py is deliberately circumvented. And in this part of the documentation, unnecessary options like USE_SINGLE_PREC and USE_COMPLEX are automatically off. Should this approach (calling cmake directly) be recommended to the users to save their time for debugging cmake outputs and finding the location of cmake tmp objects?

These are just based on my experience. I am also interested in your experience and the motivation behind these changes.

@chillenb
Copy link
Contributor Author

chillenb commented Jul 5, 2024

block2 is a relatively mature code so it is important to make changes only when necessary, and keep backward compatibility, even when our dependences break it.

🫡

The changes look ok. Is this really necessary (namely, do we have a user who will need two different executables/libraries for their applications)? Will this be advantageous than just running cmake twice? The current solution works perfectly for my purpose.

It's not really necessary, but would improve the build efficiency by reusing object files. If you build the Python module, you can also build the executable and shared library almost for free. I've experimented with this on this branch. Such a feature would be mainly useful for developers and downstream packagers, etc., who might want to build an executable and a shared library at the same time. If you don't see too many concrete advantages coming from this, I can take it out.

  1. Precompiling the headers may affect the parallelization of the jobs.
  2. Long time ago my tests showed that there is no big speed improvement by precompiling headers. If you observed any improvement due to this change please let me know.

Then I have some encouraging numbers for you!! The following benchmarks were collected on a node with 48 cores (cascade lake); GCC 12.3.0. Perhaps GCC has made some improvements recently.

Building the python extension with default settings, -j 48, no precompiled headers:
21704.06 user 670.18 system 8:15.60 elapsed 4514% CPU

Building the python extension with default settings, -j 48, WITH precompiled headers:
16437.14 user 530.60 system 6:43.73 elapsed 4202% CPU

Pre-compiling the headers takes under 10 seconds at the beginning of the build, but the total build time is reduced by 1 min 30 s (about 18%).

Building the executable with default settings, -j 48, no precompiled headers:
11746.45 user 371.19 system 4:37.11 elapsed 4372% CPU

Building the executable with default settings, -j 48, WITH precompiled headers:
8045.66 user 286.01 system 3:38.92 elapsed 3805% CPU

There was also a noticeable reduction in the peak memory usage of roughly 10~20%. These improvements not huge, but they could still be considered significant.

This seems to be okay at the moment, but are we promoting a good practice or a bad practice by doing this? From the users' perspective, pip install block2 can take seconds (if the OS and Python version are supported) or hours (if it grabs the source code and tries to compile it). The second situation is disfavored, because block2 is a C++ heavy project with many compiling options. And there is no easy way to adjust these options through setup.py. Yes you can add this environment variables for parallel level, but how about other cmake options? From the "saving compiling time" perspective, disabling some options like USE_SINGLE_PREC and USE_COMPLEX should always be more practical than changing the parallel level.

To be a bit clearer, here's the situation I am actually facing: Let's say I want to build a wheel against the school cluster's MPI installation so that some colleagues can install it in their Python environments. If I make a wheel or a conda package, I can use auditwheel or conda-build to make sure it's relocatable. It's not as easy to distribute the custom block2 if I build it manually with CMake. I have to run patchelf so that the python extension can link to whatever MKL is in my colleagues' environment, give instructions about PYTHONPATH, etc.

In a case like this, it would be more convenient to edit setup.py and then build the wheel. The benefit of the proposed change is: it'd save a tiny amount of typing if the --jobs=2 could be left in place 😆 . As you pointed out, other CMake configurations would have to be added manually; there's no real easy way out of that.

A more straightforward example is, if somebody wants to build block2 as a conda package with a specific number of processes other than 2 (perhaps the number is set by some CI provider) they would be able to do so without patching setup.py.

I guess it is worth mentioning that adding scikit-build-core as an alternate build backend could help with this. It supports passing additional CMake options on the command line: -Ccmake.define.<OPTION_NAME>=<value>. I have a rough implementation of the backend-switching code and can share it on the other PR after cleaning it up.

@hczhai
Copy link
Contributor

hczhai commented Jul 5, 2024

Thanks for sharing your ideas and motivations. These are all very helpful.

Such a feature would be mainly useful for developers and downstream packagers, etc.

Most of the time as a developer I will only debug with one kind of interface (because it does not make sense to write the test file in two languages for debugging, and compile two libraries, and change two lines in two test files and then recompiling the two libraries). Since in your new branch you have to introduce a new cmake option, which will break the backward compatibility and the code length of CMakeLists.txt increases due to this change, so complexity increases. I prefer not to add this feature.

There was also a noticeable reduction in the peak memory usage of roughly 10~20%. These improvements not huge, but they could still be considered significant.

These numbers are indeed impressive. As this change does not have the backward compatibility problem, it is very decoupled with other things, and it alleviated a real problem, I am happy to accept it (if it does not create problem for non-gcc compilers).

In a case like this, it would be more convenient to edit setup.py and then build the wheel.

Yes, what you described is a valid usage case, and it is the same case as what I do in github actions: https://github.com/block-hczhai/block2-preview/blob/master/.github/workflows/build.yml#L270-L288 And what I did there is exactly directly editing setup.py.

The benefit of the proposed change is: it'd save a tiny amount of typing if the --jobs=2 could be left in place 😆 .

For this I do not get your point. With the current setup.py, I just need to sed -i "s/jobs=2/jobs=4/" setup.py which is 34 characters. With your change you have to do export CMAKE_BUILD_PARALLEL_LEVEL=4 which is 35 characters, so it does not save any typing. The user has to read a longer setup.py and remember the name of environment variables. If there is a typo in CMAKE_BUILD_PARALLEL_LEVEL, it will be a silent bug.

there's no real easy way out of that.

sed is just the easy way to do this. It does not make setup.py tediously long with a list of environment variables. It does not need any dependences. And it allows you to change whatever cmake options without even thinking of "what should be the best environment variable name for this one".

I guess it is worth mentioning that adding scikit-build-core as an alternate build backend could help with this.

So far in my opinion the "build backend" is a "not-a-problem" problem. It will not be very worthy if you end up with a complicated backend co-existing solution with increased code length and dependence. I think you are already familiar with the current build script and it worked well for all your purpose, which is already a miracle in a modern software environment full of artificial complexities and pitfalls.

Just have a look at the PR list of scikit-build-core (https://github.com/scikit-build/scikit-build-core/pulls):

The first one PR 793 is to deprecate things, for fixing issue 747. But issue 747 is not an issue. (And "fixing" this issue will create real issues for applications depending on it.)
The second one PR 769 is to drop Python 3.7. But Python 3.7 does not have any problem at this moment. (And dropping Python 3.7 will create real issues for applications depending on it that want to support Python 3.7.)
The third one PR 752 is to support Python 3.13. But no one needs this at this moment.

In the end, lots of efforts were wasted in (a) jumping between alternative things; (b) removing useful things; and (c) adding imaginary things.

Let's hope we can do better.

@chillenb
Copy link
Contributor Author

chillenb commented Jul 5, 2024

I've gotten rid of everything in the PR except the option to use precompiled headers. It's off by default and doesn't seem to break backward compatibility, so hopefully it is compatible with the current project goals. Honestly, I am relieved that I didn't 100% waste your time!

Also, while testing with clang on linux64, I found that clang was not able to utilize libgomp. Changing line 445 so that clang could use libiomp5 or libomp seemed to fix this in a natural way. If you know a better way to deal with this issue, perhaps without changing CMakeLists.txt, I can use that instead.

For this I do not get your point. With the current setup.py, I just need to sed -i "s/jobs=2/jobs=4/" setup.py which is 34 characters. With your change you have to do export CMAKE_BUILD_PARALLEL_LEVEL=4 which is 35 characters, so it does not save any typing. The user has to read a longer setup.py and remember the name of environment variables. If there is a typo in CMAKE_BUILD_PARALLEL_LEVEL, it will be a silent bug.

I apologize---the part about saving typing was supposed to be a dumb joke. Indeed, no effort is really saved. I just need to get better at sed 😆

Just have a look at the PR list of scikit-build-core (https://github.com/scikit-build/scikit-build-core/pulls):

It does look like a mess. I can see why you feel that depending on scikit-build-core (and keeping up to date with all the changes) is unwise for block2 at the moment. It is useful for small projects that want to get started quickly---time will tell if it can become a stable and reliable tool.

Thanks for all the time and effort you spent responding to my proposals so thoroughly. I really appreciate it!

@hczhai
Copy link
Contributor

hczhai commented Jul 5, 2024

Thanks for your understanding.

Honestly, I am relieved that I didn't 100% waste your time!

It is not a waste of time. For me it is good to know other people's experience and requests. Every ugly line in CMakeLists.txt has its reason. If no one asks maybe one day I will forget how to justify those choices!

Also, while testing with clang on linux64, I found that clang was not able to utilize libgomp. Changing line 445 so that clang could use libiomp5 or libomp seemed to fix this in a natural way. If you know a better way to deal with this issue, perhaps without changing CMakeLists.txt, I can use that instead.

If you used clang+libiomp5, then you can set -DOMP_LIB=INTEL explicitly and it will go this branch: https://github.com/block-hczhai/block2-preview/blob/master/CMakeLists.txt#L417-L427. So if you verified that this works, then it means the original CMakeLists.txt can already handle clang+libiomp5, right?

If you used clang+libomp, then it is a case not covered in the current CMakeLists.txt. Please let me know whether this is the case you need. If yes, we need to make sure linking this together with MKL libmkl_intel_thread works, by running a DMRG script (with n_threads=8, n_mkl_threads=4 for example) because there can be runtime error due to the the wrong combination. If this is not the case you are interested, then we can skip the test and revert that line.

Also note that if you use block2 and pyscf in the same Python script, and they are linked to different openMP libraries, then this can cause runtime error or produce wrong numbers. Actually I do not know how to adjust the openMP library for pyscf. This is one of the reason why we do not recommend the user to compile block2 manually. The best solution will be to get rid of the openMP dependence in block2, but we were too late to recognize this problem.

Let's say I want to build a wheel against the school cluster's MPI installation so that some colleagues can install it in their Python environments.

So you probably need three wheels. The block2 wheel with a specific openMP and MPI combination. The pyscf wheel with the same openMP lib. The mpi4py wheel with the same MPI lib. Note that openMPI libraries change API (and cause incompatibility) after each change of minor version number. All these have to be done for a few different Python versions (if you want your colleagues to be able to choose different Python versions).

@chillenb
Copy link
Contributor Author

chillenb commented Jul 5, 2024

If you used clang+libiomp5, then you can set -DOMP_LIB=INTEL explicitly and it will go this branch: https://github.com/block-hczhai/block2-preview/blob/master/CMakeLists.txt#L417-L427. So if you verified that this works, then it means the original CMakeLists.txt can already handle clang+libiomp5, right?

If you used clang+libomp, then it is a case not covered in the current CMakeLists.txt. Please let me know whether this is the case you need. If yes, we need to make sure linking this together with MKL libmkl_intel_thread works, by running a DMRG script (with n_threads=8, n_mkl_threads=4 for example) because there can be runtime error due to the the wrong combination. If this is not the case you are interested, then we can skip the test and revert that line.

Ok, I see what you mean!! The change was meant to help clang link libiomp5 by default, but I will use -DOMP_LIB=INTEL and revert the change. The threadpoolctl developers have reported bad things happening when libiomp5 and libomp are mixed. It seems that mixing libomp and mkl_intel_thread is therefore likely to fail, so I won't try it.

Also note that if you use block2 and pyscf in the same Python script, and they are linked to different openMP libraries, then this can cause runtime error or produce wrong numbers. Actually I do not know how to adjust the openMP library for pyscf. This is one of the reason why we do not recommend the user to compile block2 manually. The best solution will be to get rid of the openMP dependence in block2, but we were too late to recognize this problem.

Thank you for the warning. I have not noticed any problems from this yet, but I'll keep it in mind.

It seems like PySCF chooses its OpenMP library this way. Probably it uses the openmp runtime that comes with the compiler. I don't think there is way to change this aside from editing the CMakeLists.txt.

The wheels available on PyPI ship libgomp-a34b3233.so.1.0.0. Since libiomp5 is binary compatible with libgomp, I guess you could replace PySCF's libgomp.so with a symlink to block2's libgomp or libiomp5. I haven't tested this but it's worth a shot.

So you probably need three wheels. The block2 wheel with a specific openMP and MPI combination. The pyscf wheel with the same openMP lib. The mpi4py wheel with the same MPI lib.

Thanks a lot for these notes and instructions. They're very helpful!

Most of my colleagues are using conda environments with the conda-forge channel. Conda-forge has a special mechanism to ensure that all packages use the same openmp. So if I build conda packages of pyscf and block2 with conda-forge toolchain (or use prebuilt pyscf and only build block2) that should take care of openmp consistency. But the MPI / mpi4py and python versions will definitely be tedious.

(speaking of which, do you think adding block2 to conda-forge is a good idea?)

@hczhai
Copy link
Contributor

hczhai commented Jul 5, 2024

Thanks for letting me know all these details.

It seems that mixing libomp and mkl_intel_thread is therefore likely to fail, so I won't try it.

So in order to let the user be aware of this pitfall, I think it makes sense to explicitly require the user to set -DOMP_LIB=INTEL and so that they know they are using the intel openmp library.

I don't think there is way to change this aside from editing the CMakeLists.txt.

So you see why block2 CMakeLists.txt has to be more flexible (and complicated, but not artificially complicated).

(speaking of which, do you think adding block2 to conda-forge is a good idea?)

If you are able to handle and maintain that, it would be great. If there has to be some updates in github actions or setup.py I can provide some help. If no changes are required from the block2 side, that will be even better. I can also add this in README of block2 once this is available (but pip install will not be deleted, and the conda one will not be recommended as the default). conda itself is another source of complexity (see, for example, #81 (comment)). But since many people rely on it so it can be an option.

@hczhai hczhai marked this pull request as ready for review July 5, 2024 23:11
@hczhai hczhai merged commit dbe1ff6 into block-hczhai:master Jul 5, 2024
30 of 31 checks passed
@chillenb
Copy link
Contributor Author

chillenb commented Jul 5, 2024

@hczhai

So you see why block2 CMakeLists.txt has to be more flexible (and complicated, but not artificially complicated).

Yes!

If you are able to handle and maintain that, it would be great. If there has to be some updates in github actions or setup.py I can provide some help. If no changes are required from the block2 side, that will be even better. I can also add this in README of block2 once this is available (but pip install will not be deleted, and the conda one will not be recommended as the default). conda itself is another source of complexity (see, for example, #81 (comment)). But since many people rely on it so it can be an option.

Sure, I don't mind doing this. Although conda has issues of its own, I think it will be convenient for many users, and it should be possible to avoid the worst hazards (like conflicting openmp libraries).

(see, for example, #81 (comment))
I didn't realize that Intel's numpy had such a bug. Hopefully, it's in their numpy package and not MKL itself!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants