Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: g++: fatal error: Killed signal terminated program cc1plus #6833

Closed
ThomasHoppe opened this issue Jul 18, 2023 · 30 comments
Closed

BUG: g++: fatal error: Killed signal terminated program cc1plus #6833

ThomasHoppe opened this issue Jul 18, 2023 · 30 comments
Labels
installation issues about dependencies or installation pytensor

Comments

@ThomasHoppe
Copy link

ThomasHoppe commented Jul 18, 2023

Describe the issue:

During compilation of models compiler receives a kill signal (reason unknown).
Can be reproduced with two different models.

Reproduceable code example:

Code example is longer, see attached notebook and data file below

Error message:

CompileError: Compilation failed (return status=1):
/usr/bin/g++ -shared -g -O3 -fno-math-errno -Wno-unused-label -Wno-unused-variable -Wno-write-strings -Wno-c++11-narrowing -fno-exceptions -fno-unwind-tables -fno-asynchronous-unwind-tables -march=broadwell -mmmx -mno-3dnow -msse -msse2 -msse3 -mssse3 -mno-sse4a -mcx16 -msahf -mmovbe -maes -mno-sha -mpclmul -mpopcnt -mabm -mno-lwp -mfma -mno-fma4 -mno-xop -mbmi -mno-sgx -mbmi2 -mno-pconfig -mno-wbnoinvd -mno-tbm -mavx -mavx2 -msse4.2 -msse4.1 -mlzcnt -mrtm -mhle -mrdrnd -mf16c -mfsgsbase -mrdseed -mprfchw -madx -mfxsr -mxsave -mxsaveopt -mno-avx512f -mno-avx512er -mno-avx512cd -mno-avx512pf -mno-prefetchwt1 -mno-clflushopt -mno-xsavec -mno-xsaves -mno-avx512dq -mno-avx512bw -mno-avx512vl -mno-avx512ifma -mno-avx512vbmi -mno-avx5124fmaps -mno-avx5124vnniw -mno-clwb -mno-mwaitx -mno-clzero -mno-pku -mno-rdpid -mno-gfni -mno-shstk -mno-avx512vbmi2 -mno-avx512vnni -mno-vaes -mno-vpclmulqdq -mno-avx512bitalg -mno-avx512vpopcntdq -mno-movdiri -mno-movdir64b -mno-waitpkg -mno-cldemote -mno-ptwrite --param l1-cache-size=32 --param l1-cache-line-size=64 --param l2-cache-size=4096 -mtune=broadwell -DNPY_NO_DEPRECATED_API=NPY_1_7_API_VERSION -m64 -fPIC -I/home/thomas/.local/lib/python3.8/site-packages/numpy/core/include -I/usr/include/python3.8 -I/home/thomas/.local/lib/python3.8/site-packages/pytensor/link/c/c_code -L/usr/lib -fvisibility=hidden -o /home/thomas/.pytensor/compiledir_Linux-5.15-microsoft-standard-WSL2-x86_64-with-glibc2.29-x86_64-3.8.10-64/tmpqc37cguk/m80e56c88364a8e9d8553f659577acc090143a8d54f9ef83c7c12ac7eb91aecfa.so /home/thomas/.pytensor/compiledir_Linux-5.15-microsoft-standard-WSL2-x86_64-with-glibc2.29-x86_64-3.8.10-64/tmpqc37cguk/mod.cpp -lpython3.8
/home/thomas/.pytensor/compiledir_Linux-5.15-microsoft-standard-WSL2-x86_64-with-glibc2.29-x86_64-3.8.10-64/tmpqc37cguk/mod.cpp: In member function ‘int {anonymous}::__struct_compiled_op_m80e56c88364a8e9d8553f659577acc090143a8d54f9ef83c7c12ac7eb91aecfa::run()’:
/home/thomas/.pytensor/compiledir_Linux-5.15-microsoft-standard-WSL2-x86_64-with-glibc2.29-x86_64-3.8.10-64/tmpqc37cguk/mod.cpp:5249:13: note: variable tracking size limit exceeded with ‘-fvar-tracking-assignments’, retrying without
 5249 |         int run(void) {
      |             ^~~
g++: fatal error: Killed signal terminated program cc1plus
compilation terminated.

PyMC version information:

Occured in 5.5.0 and 5.6.1

Detailed watermark:

Last updated: Tue Jul 18 2023

Python implementation: CPython
Python version : 3.8.10
IPython version : 8.0.1

arviz : 0.15.1
pandas : 2.0.2
daft : 0.1.2
pymc : 5.6.1
matplotlib: 3.7.1
numpy : 1.22.1
scipy : 1.7.3
pytensor: 2.12.3

Watermark: 2.3.0

Operating System: Ubuntu 20.04.6 LTS Subsystem under Windows 10 WSL-2
PyMC installation via pip

Context for the issue:

Stops further evaluation of the model with sample_posterior_prediction

D1.csv
compiler-bug.zip

@twiecki
Copy link
Member

twiecki commented Jul 18, 2023

Installation with pip is not supported (because the compiler situation is too difficult), you need to use mamba or conda.

@twiecki twiecki closed this as completed Jul 18, 2023
@ThomasHoppe
Copy link
Author

@twiecki:

I reinstalled now pymc under conda, but the problem remains :-(

Operating System: Ubuntu 20.04.6 LTS Subsystem under Windows 10 WSL-2
PyMC installation via conda (miniconda)

Last updated: Tue Jul 25 2023

Python implementation: CPython
Python version : 3.8.17
IPython version : 8.0.1

arviz : 0.15.1
numpy : 1.22.1
matplotlib: 3.7.1
scipy : 1.7.3
pandas : 2.0.2
pymc : 5.6.1

Watermark: 2.3.0

CompileError: Compilation failed (return status=1):
/usr/bin/g++ -shared -g -O3 -fno-math-errno -Wno-unused-label -Wno-unused-variable -Wno-write-strings -Wno-c++11-narrowing -fno-exceptions -fno-unwind-tables -fno-asynchronous-unwind-tables -march=broadwell -mmmx -mno-3dnow -msse -msse2 -msse3 -mssse3 -mno-sse4a -mcx16 -msahf -mmovbe -maes -mno-sha -mpclmul -mpopcnt -mabm -mno-lwp -mfma -mno-fma4 -mno-xop -mbmi -mno-sgx -mbmi2 -mno-pconfig -mno-wbnoinvd -mno-tbm -mavx -mavx2 -msse4.2 -msse4.1 -mlzcnt -mrtm -mhle -mrdrnd -mf16c -mfsgsbase -mrdseed -mprfchw -madx -mfxsr -mxsave -mxsaveopt -mno-avx512f -mno-avx512er -mno-avx512cd -mno-avx512pf -mno-prefetchwt1 -mno-clflushopt -mno-xsavec -mno-xsaves -mno-avx512dq -mno-avx512bw -mno-avx512vl -mno-avx512ifma -mno-avx512vbmi -mno-avx5124fmaps -mno-avx5124vnniw -mno-clwb -mno-mwaitx -mno-clzero -mno-pku -mno-rdpid -mno-gfni -mno-shstk -mno-avx512vbmi2 -mno-avx512vnni -mno-vaes -mno-vpclmulqdq -mno-avx512bitalg -mno-avx512vpopcntdq -mno-movdiri -mno-movdir64b -mno-waitpkg -mno-cldemote -mno-ptwrite --param l1-cache-size=32 --param l1-cache-line-size=64 --param l2-cache-size=4096 -mtune=broadwell -DNPY_NO_DEPRECATED_API=NPY_1_7_API_VERSION -m64 -fPIC -I/home/thomas/.local/lib/python3.8/site-packages/numpy/core/include -I/home/thomas/miniconda3/envs/pymc/include/python3.8 -I/home/thomas/.local/lib/python3.8/site-packages/pytensor/link/c/c_code -L/home/thomas/miniconda3/envs/pymc/lib -fvisibility=hidden -o /home/thomas/.pytensor/compiledir_Linux-5.15-microsoft-standard-WSL2-x86_64-with-glibc2.17-x86_64-3.8.17-64/tmph2y66i87/m80e56c88364a8e9d8553f659577acc090143a8d54f9ef83c7c12ac7eb91aecfa.so /home/thomas/.pytensor/compiledir_Linux-5.15-microsoft-standard-WSL2-x86_64-with-glibc2.17-x86_64-3.8.17-64/tmph2y66i87/mod.cpp -lpython3.8

g++: fatal error: Killed signal terminated program cc1plus
compilation terminated.

@ricardoV94 ricardoV94 reopened this Jul 25, 2023
@ricardoV94 ricardoV94 added installation issues about dependencies or installation and removed bug labels Jul 25, 2023
@twiecki
Copy link
Member

twiecki commented Jul 25, 2023

Hm, it seems it's still using the system compile (/usr/bin/g++), whereas it should use the compilers from the environment. Are you sure you activated the environment correctly? Also, can you post the outputs of: mamba list and which g++?

@ThomasHoppe
Copy link
Author

I am definitly sure that the environment was activated correctly. This python version is only used for pymc.

Here is the module list and the output of g++ -v:

conda_list.txt
g++-version.txt

@twiecki
Copy link
Member

twiecki commented Jul 26, 2023

That's not the output of which g++.

@ThomasHoppe
Copy link
Author

which g++ gives /usr/bin/g++

@twiecki
Copy link
Member

twiecki commented Jul 27, 2023

This is what it shows for me:

>>which clang
clang is /Users/twiecki/micromamba/envs/pymc5/bin/clang
clang is /usr/bin/clang

You can see it has a compiler installed in my env which you lack, not sure why. But you can try to install it manually.

@ThomasHoppe
Copy link
Author

I installed clang outside and environment which clang shows /usr/bin/clang.
Even if I install clang inside an env which clang ´still shows /usr/bin/clang.

But still I got

/home/thomas/.local/lib/python3.8/site-packages/pytensor/tensor/rewriting/elemwise.py:1019: UserWarning: Loop fusion failed because the resulting node would exceed the kernel argument limit.
warn(
Auto-assigning NUTS sampler...
Initializing NUTS using jitter+adapt_diag...

CompileError: Compilation failed (return status=1):
/usr/bin/g++ -shared -g -O3 -fno-math-errno -Wno-unused-label -Wno-unused-variable -Wno-write-strings -Wno-c++11-narrowing -fno-exceptions -fno-unwind-tables -fno-asynchronous-unwind-tables -march=broadwell -mmmx -mno-3dnow -msse -msse2 -msse3 -mssse3 -mno-sse4a -mcx16 -msahf -mmovbe -maes -mno-sha -mpclmul -mpopcnt -mabm -mno-lwp -mfma -mno-fma4 -mno-xop -mbmi -mno-sgx -mbmi2 -mno-pconfig -mno-wbnoinvd -mno-tbm -mavx -mavx2 -msse4.2 -msse4.1 -mlzcnt -mrtm -mhle -mrdrnd -mf16c -mfsgsbase -mrdseed -mprfchw -madx -mfxsr -mxsave -mxsaveopt -mno-avx512f -mno-avx512er -mno-avx512cd -mno-avx512pf -mno-prefetchwt1 -mno-clflushopt -mno-xsavec -mno-xsaves -mno-avx512dq -mno-avx512bw -mno-avx512vl -mno-avx512ifma -mno-avx512vbmi -mno-avx5124fmaps -mno-avx5124vnniw -mno-clwb -mno-mwaitx -mno-clzero -mno-pku -mno-rdpid -mno-gfni -mno-shstk -mno-avx512vbmi2 -mno-avx512vnni -mno-vaes -mno-vpclmulqdq -mno-avx512bitalg -mno-avx512vpopcntdq -mno-movdiri -mno-movdir64b -mno-waitpkg -mno-cldemote -mno-ptwrite --param l1-cache-size=32 --param l1-cache-line-size=64 --param l2-cache-size=4096 -mtune=broadwell -DNPY_NO_DEPRECATED_API=NPY_1_7_API_VERSION -m64 -fPIC -I/home/thomas/.local/lib/python3.8/site-packages/numpy/core/include -I/home/thomas/miniconda3/envs/pymc5/include/python3.8 -I/home/thomas/.local/lib/python3.8/site-packages/pytensor/link/c/c_code -L/home/thomas/miniconda3/envs/pymc5/lib -fvisibility=hidden -o /home/thomas/.pytensor/compiledir_Linux-5.15-microsoft-standard-WSL2-x86_64-with-glibc2.17-x86_64-3.8.17-64/tmpv9hkx7wj/m68ff8b8c4d606c4dd1f8fe6d6ebc5e974ecbc23ad2e3ca82f4d826e6f743dc44.so /home/thomas/.pytensor/compiledir_Linux-5.15-microsoft-standard-WSL2-x86_64-with-glibc2.17-x86_64-3.8.17-64/tmpv9hkx7wj/mod.cpp -lpython3.8
g++: fatal error: Killed signal terminated program cc1plus
compilation terminated.

So /usr/bin/g++ is still called. Is there some additional configuration to do for switching to clang?

@twiecki
Copy link
Member

twiecki commented Jul 31, 2023

What I meant is that you need to install g++ from mamba into your environment. clang is the compile I'm using on OSX instead of g++. Something went wrong with your installation, you can also retry in a fresh env. Or try mamba install -c conda-forge gcc.

@ThomasHoppe
Copy link
Author

Well, I made a clean install.

  • I installed mamba under Linux as described on https://mamba.readthedocs.io/en/latest/installation.html from mambaforge.
    Then:
  • mamba create -n pymc
  • mamba activate pymc
  • mamba install gcc
  • mamba install pymc (which also downgraded gcc from 13.1.0 to 12.3.0 and four other packages)
  • which gcc gives /home/thomas/mambaforge/envs/pymc/bin/gcc
  • which g++ gives /home/thomas/mambaforge/envs/pymc/bin/g++
    follwed by the installation of jupyter notebook and supporting libs.

Watermark now gives:
Last updated: Wed Aug 02 2023

Python implementation: CPython
Python version : 3.11.4
IPython version : 8.14.0

arviz : 0.16.1
pandas : 2.0.3
scipy : 1.11.1
matplotlib: 3.7.2
numpy : 1.25.1
pymc : 5.7.0

Watermark: 2.4.3

Again running the compiler-bug notebook gives after

/home/thomas/mambaforge/envs/pymc/lib/python3.11/site-packages/pytensor/tensor/rewriting/elemwise.py:1028: UserWarning: Loop fusion failed because the resulting node would exceed the kernel argument limit.
warn(
Auto-assigning NUTS sampler...
Initializing NUTS using jitter+adapt_diag...

the well-known compiler bug, but now with gcc from the env

CompileError: Compilation failed (return status=1):
/home/thomas/mambaforge/envs/pymc/bin/g++ -shared -g -O3 -fno-math-errno -Wno-unused-label -Wno-unused-variable -Wno-write-strings -Wno-c++11-narrowing -fno-exceptions -fno-unwind-tables -fno-asynchronous-unwind-tables -march=broadwell -mmmx -mpopcnt -msse -msse2 -msse3 -mssse3 -msse4.1 -msse4.2 -mavx -mavx2 -mno-sse4a -mno-fma4 -mno-xop -mfma -mno-avx512f -mbmi -mbmi2 -maes -mpclmul -mno-avx512vl -mno-avx512bw -mno-avx512dq -mno-avx512cd -mno-avx512er -mno-avx512pf -mno-avx512vbmi -mno-avx512ifma -mno-avx5124vnniw -mno-avx5124fmaps -mno-avx512vpopcntdq -mno-avx512vbmi2 -mno-gfni -mno-vpclmulqdq -mno-avx512vnni -mno-avx512bitalg -mno-avx512bf16 -mno-avx512vp2intersect -mno-3dnow -madx -mabm -mno-cldemote -mno-clflushopt -mno-clwb -mno-clzero -mcx16 -mno-enqcmd -mf16c -mfsgsbase -mfxsr -mhle -msahf -mno-lwp -mlzcnt -mmovbe -mno-movdir64b -mno-movdiri -mno-mwaitx -mno-pconfig -mno-pku -mno-prefetchwt1 -mprfchw -mno-ptwrite -mno-rdpid -mrdrnd -mrdseed -mrtm -mno-serialize -mno-sgx -mno-sha -mno-shstk -mno-tbm -mno-tsxldtrk -mno-vaes -mno-waitpkg -mno-wbnoinvd -mxsave -mno-xsavec -mxsaveopt -mno-xsaves -mno-amx-tile -mno-amx-int8 -mno-amx-bf16 -mno-uintr -mno-hreset -mno-kl -mno-widekl -mno-avxvnni -mno-avx512fp16 --param l1-cache-size=32 --param l1-cache-line-size=64 --param l2-cache-size=4096 -mtune=broadwell -DNPY_NO_DEPRECATED_API=NPY_1_7_API_VERSION -m64 -fPIC -I/home/thomas/mambaforge/envs/pymc/lib/python3.11/site-packages/numpy/core/include -I/home/thomas/mambaforge/envs/pymc/include/python3.11 -I/home/thomas/mambaforge/envs/pymc/lib/python3.11/site-packages/pytensor/link/c/c_code -L/home/thomas/mambaforge/envs/pymc/lib -fvisibility=hidden -o /home/thomas/.pytensor/compiledir_Linux-5.15-microsoft-standard-WSL2-x86_64-with-glibc2.31-x86_64-3.11.4-64/tmpyelpazdx/m68ff8b8c4d606c4dd1f8fe6d6ebc5e974ecbc23ad2e3ca82f4d826e6f743dc44.so /home/thomas/.pytensor/compiledir_Linux-5.15-microsoft-standard-WSL2-x86_64-with-glibc2.31-x86_64-3.11.4-64/tmpyelpazdx/mod.cpp -lpython3.11
g++: fatal error: Killed signal terminated program cc1plus
compilation terminated.

Since this used Python 3.11 and Pymc 5.7, I made a second attempt by downgrading Python to 3.8 and Pymc 3.6.1.

The paths to gcc and g++ are the same as above as well as the error.

So I think, it is not an issue with my installations.

Did you run the compiler-bug.ipynb yourself? Could you reproduce the behaviour?

Since the warning Loop fusion failed because the resulting node would exceed the kernel argument limit. appears always, couldn't it be that the translation from the tensor-network to the c-code (atleast so far I understand that from the outside) produces some kind of "loop" for the compiler and that thus the compiler runs out of space?

@ricardoV94
Copy link
Member

Did you try the conda-forge channel specifically? mamba install -c conda-forge pymc in a new environment.

@ricardoV94
Copy link
Member

ricardoV94 commented Aug 2, 2023

Since the warning Loop fusion failed because the resulting node would exceed the kernel argument limit. appears always, couldn't it be that the translation from the tensor-network to the c-code (atleast so far I understand that from the outside) produces some kind of "loop" for the compiler and that thus the compiler runs out of space?

Can you try with a very simple model?

import pymc as pm

with pm.Model() as m:
  x = pm.Normal()
  pm.sample()

It is not clear for me if you see a problem with specific models or in general

@ThomasHoppe
Copy link
Author

mamba install -c conda-forge pymc gives as output

Looking for: ['pymc']

conda-forge/noarch 13.5MB @ 4.0MB/s 3.7s
conda-forge/linux-64 33.4MB @ 4.7MB/s 7.7s

Pinned packages:

  • python 3.8.*

Transaction

Prefix: /home/thomas/mambaforge/envs/pymc

All requested packages already installed

@ricardoV94
Copy link
Member

You should install from a fresh environment

@ThomasHoppe
Copy link
Author

It is the specific model of the notebook. As I explained at the beginning, a colleague of mine who authored this model has no problem at all.

All of my other models worked unter PyMC 5 (after some adaptations) without problem.
Even the simple model:
`import pymc as pm

with pm.Model() as m:
x = pm.Normal("test")
pm.sample()`

Runs as expected:

Auto-assigning NUTS sampler...
Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (2 chains in 2 jobs)
NUTS: [test]

100.00% [4000/4000 00:02<00:00 Sampling 2 chains, 0 divergences]

Sampling 2 chains for 1_000 tune and 1_000 draw iterations (2_000 + 2_000 draws total) took 2 seconds.
We recommend running at least 4 chains for robust computation of convergence diagnostics

@ricardoV94
Copy link
Member

So back to your case. After you install with conda-forge, can you try running a single chain? Just trying to narrow down the issue space

@ThomasHoppe
Copy link
Author

So back to your case. After you install with conda-forge, can you try running a single chain? Just trying to narrow down the issue space

Well, installed mamba install -c conda-forge pymc in a fresh env test,
Sampled with chains=1 as suggested:

with model_toto: trace_ = pm.sample(draws=nb_samples, chains=1, tune=tune)

Still got same behavior

CompileError: Compilation failed (return status=1):
/home/thomas/mambaforge/envs/test/bin/g++ -shared -g -O3 -fno-math-errno -Wno-unused-label -Wno-unused-variable -Wno-write-strings -Wno-c++11-narrowing -fno-exceptions -fno-unwind-tables -fno-asynchronous-unwind-tables -march=broadwell -mmmx -mpopcnt -msse -msse2 -msse3 -mssse3 -msse4.1 -msse4.2 -mavx -mavx2 -mno-sse4a -mno-fma4 -mno-xop -mfma -mno-avx512f -mbmi -mbmi2 -maes -mpclmul -mno-avx512vl -mno-avx512bw -mno-avx512dq -mno-avx512cd -mno-avx512er -mno-avx512pf -mno-avx512vbmi -mno-avx512ifma -mno-avx5124vnniw -mno-avx5124fmaps -mno-avx512vpopcntdq -mno-avx512vbmi2 -mno-gfni -mno-vpclmulqdq -mno-avx512vnni -mno-avx512bitalg -mno-avx512bf16 -mno-avx512vp2intersect -mno-3dnow -madx -mabm -mno-cldemote -mno-clflushopt -mno-clwb -mno-clzero -mcx16 -mno-enqcmd -mf16c -mfsgsbase -mfxsr -mhle -msahf -mno-lwp -mlzcnt -mmovbe -mno-movdir64b -mno-movdiri -mno-mwaitx -mno-pconfig -mno-pku -mno-prefetchwt1 -mprfchw -mno-ptwrite -mno-rdpid -mrdrnd -mrdseed -mrtm -mno-serialize -mno-sgx -mno-sha -mno-shstk -mno-tbm -mno-tsxldtrk -mno-vaes -mno-waitpkg -mno-wbnoinvd -mxsave -mno-xsavec -mxsaveopt -mno-xsaves -mno-amx-tile -mno-amx-int8 -mno-amx-bf16 -mno-uintr -mno-hreset -mno-kl -mno-widekl -mno-avxvnni -mno-avx512fp16 --param l1-cache-size=32 --param l1-cache-line-size=64 --param l2-cache-size=4096 -mtune=broadwell -DNPY_NO_DEPRECATED_API=NPY_1_7_API_VERSION -m64 -fPIC -I/home/thomas/.local/lib/python3.8/site-packages/numpy/core/include -I/home/thomas/mambaforge/envs/test/include/python3.8 -I/home/thomas/.local/lib/python3.8/site-packages/pytensor/link/c/c_code -L/home/thomas/mambaforge/envs/test/lib -fvisibility=hidden -o /home/thomas/.pytensor/compiledir_Linux-5.15-microsoft-standard-WSL2-x86_64-with-glibc2.10-x86_64-3.8.17-64/tmpi_iiqq0k/m68ff8b8c4d606c4dd1f8fe6d6ebc5e974ecbc23ad2e3ca82f4d826e6f743dc44.so /home/thomas/.pytensor/compiledir_Linux-5.15-microsoft-standard-WSL2-x86_64-with-glibc2.10-x86_64-3.8.17-64/tmpi_iiqq0k/mod.cpp -lpython3.8
g++: fatal error: Killed signal terminated program cc1plus
compilation terminated.

Did you run the supplied notebook? How did it behave in your environment?

@twiecki
Copy link
Member

twiecki commented Aug 2, 2023

I think you might not have enough resources (RAM) so g++ is getting killed. E.g. soedinglab/hh-suite#280

@ThomasHoppe
Copy link
Author

I increased the limit for the main storage to 10 GB and still the same error occured. Actually I can't believe that a compilation of roughly 8 MB of C-Code (compare the attached generated code file) cannot be done within 10 GB
pytensor_compilation_error_1pmxatij.zip

@twiecki
Copy link
Member

twiecki commented Aug 14, 2023

@maresb could this be an arch issue?

@maresb
Copy link
Contributor

maresb commented Aug 15, 2023

No, this should be pure linux-64. This feels to me like a memory issue. Maybe the 10GB is not being made available somehow. I would check the output of free, and then look in /var/log/syslog for messages from the kernel's OOM-killer.

@ThomasHoppe
Copy link
Author

I increased the limit for the main storage to 10 GB and still the same error occured. Actually I can't believe that a compilation of roughly 8 MB of C-Code (compare the attached generated code file) cannot be done within 10 GB

@twiecki
Copy link
Member

twiecki commented Sep 12, 2023

@ThomasHoppe Not disk space but RAM.

@ThomasHoppe
Copy link
Author

If I say main memory, I do not talk about disc space. Im talking about 10GB of RAM !
The 10 GB are available. Take a look at the excerpt of the syslog.

I enclose also a video showing the last 6 minutes from 31 minutes of the call to pymc.sample where you can see from htop and pmap that the storage usage of cc1plus increases within these 6 minutes from rougly 2GB to more than 10GB.

syslog-htop-pmap-video.zip

@twiecki
Copy link
Member

twiecki commented Sep 12, 2023

@ThomasHoppe I misunderstood. Then it's definitely not the RAM. I'm a bit stumped, because it's not a compiler error but the compiler getting killed.

@ThomasHoppe
Copy link
Author

State of the bug isolation:

  • Clean install in separate mamba environment (hence dependencies should be correct)
  • Excluded RAM restrictions
  • Found increased memory consumption of the compiler in the last 6 minutes of the 30 minutes process during compilation
  • Remaining confounder for the compiler behavior are 1) until now unnoticed bug in the compiler itself 2) a configuration issue of the parameters used to call the compiler 3) the generated C-Code which depends on PyTensor
  1. considering the frequent usage of GCC, it is not very probable that such a compiler bug wasn't found yet
  2. this is a possibility I cannot exclude. If I inspect the compiler parameters in the error message of Aug, 2. I see -I/home/thomas/.local/lib/python3.8/site-packages/numpy/core/include which is definitly an inclusion path outside the used mamba environment. Could this be the reason?
  3. As computer scientist I would conclude, that the trouble is more likely caused by the generated C-Code, which causes the compiler in one way or the other to allocate more and more memory.

@ricardoV94
Copy link
Member

ricardoV94 commented Sep 13, 2023

@ThomasHoppe I didn't have time to look at your model before. I believe the source of the problem is that you have a very inefficient model. You are doing a series of operations per row of data, which builds a very large latent graph. You can probably vectorize your operations using advanced indexing, which will make the computational graph of the model much simpler and shorter to compile.

@ricardoV94
Copy link
Member

ricardoV94 commented Sep 13, 2023

Here is how I would write your last model (probably has bugs!!!):

#import sklearn.preprocessing
model_toto = pm.Model()

with model_toto:
    score = pm.Normal("score", tau=1., mu=0., shape=nb_clubs)
    advantage_defence_diff = pm.Normal("offence_defence_diff", 
                            tau=1., mu=1.5, shape=nb_clubs)
    
    # number of goals scored more at home as away
    home_advantage = pm.Normal("home_advantage", tau=10., mu=.0)
       
    # softmax regression weights for winner predicton:
    weights = pm.Normal("weights", mu=(0., .25, -0.25), tau=100., shape=(3))    
          
    heim = np.array([hg[0] for hg in home_goals_])
    gast = np.array([hg[1] for hg in home_goals_])
    h_goals = np.array([hg[2] for hg in home_goals_])
    
    heim_ = np.array([ag[0] for hg in away_goals_])
    gast_ = np.array([ag[1] for hg in away_goals_])
    a_goals = np.array([ag[2] for hg in away_goals_])
    
    s_h_, add_h = score[heim], advantage_defence_diff[heim]
    s_g, add_g = score[gast], advantage_defence_diff[gast]
    
    s_h = s_h_ + home_advantage
    
    offence_heim = s_h + add_h
    defence_heim = s_h - add_h
    offence_gast = s_g + add_g
    defence_gast = s_g - add_g
            
    home_value = offence_heim - defence_gast
    away_value = offence_gast - defence_heim
        
    score_diff = s_h-s_g # can be negative!
        
    ### no negative values
    home_value = pm.math.switch(pm.math.lt(home_value, 0.), low, home_value)
    away_value = pm.math.switch(pm.math.lt(away_value, 0.), low, away_value)

    # for prediction of the winner
    toto = np.where(
        h_goals == a_goals,
        0,
        np.where(
            h_goals > a_goals,
            1,
            2
        ),
    )
                    
    mu_home = pm.Deterministic("home_rate", home_value)
    pm.Poisson("home_goals", observed=home_goals, mu=mu_home)

    mu_away = pm.Deterministic("away_rate", away_value)
    pm.Poisson("away_goals", observed=away_goals, mu=mu_away)
    
    ha_diff = score_diff
    ha_diff = ha_diff.reshape((-1,1))
    ha_diff = ha_diff.repeat(3, axis=1)  
        
    pred = pm.math.exp(ha_diff * weights)
    pred = (pred.T/pm.math.sum(pred, axis=1)).T
    pm.Categorical('toto', p=pred, observed=toto)

Those index and numerical operations are vectorized just like numpy, and your model won't grow exponentially in complexity with your data size.

@ThomasHoppe
Copy link
Author

@ThomasHoppe I didn't have time to look at your model before. I believe the source of the problem is that you have a very inefficient model. You are doing a series of operations per row of data, which builds a very large latent graph. You can probably vectorize your operations using advanced indexing, which will make the computational graph of the model much simpler and shorter to compile.

@ricardoV94: Thanks, for the suggestion. Actually, the model was designed by a colleague, who has no problems running it. He does not encounter the compiler problem. I also found that the iterative solution wouldn't be ideal, but hadn't the time diving deeper into it, without a running reference solution. Your seems to me quite plausible and we will give it a try ...

@ricardoV94
Copy link
Member

Let us know if it works. If not, the right place to continue this discussion would be on discourse: https://discourse.pymc.io/

Regarding your colleague, even if he could manage to compile, I am certain the model will be considerably slower the way he wrote it down. I'll close this issue in the meantime, as it's not clear it would be worth the trouble to try and make the compiler more robust to very large graphs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
installation issues about dependencies or installation pytensor
Projects
None yet
Development

No branches or pull requests

4 participants