Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error after installing CUDA extension for Cauchy multiplication #62

Open
gitbooo opened this issue Aug 26, 2022 · 13 comments
Open

Error after installing CUDA extension for Cauchy multiplication #62

gitbooo opened this issue Aug 26, 2022 · 13 comments

Comments

@gitbooo
Copy link

gitbooo commented Aug 26, 2022

I'm trying to reproduce experiments but the code is retuning a KeyError 'nvrtc' and the warning [src.models.sequence.ss.kernel][WARNING] - CUDA extension for Cauchy multiplication not found still appearing.

Otherwise, I'm getting this erreur : Fatal error condition occurred in /opt/vcpkg/buildtrees/aws-c-io/src/9e6648842a-364b708815.clean/source/event_loop.c:72: aws_thread_launch(&cleanup_thread, s_event_loop_destroy_async_thread_fn, el_group, &thread_options) == AWS_OP_SUCCESS

@albertfgu
Copy link
Contributor

Can you elaborate on the 'nvrtc' error? Can you uninstall and reinstall the extension (pip uninstall cauchy-mult and cd extensions/cauchy && python setup.py install) and copy what it prints?

Does the code run if you completely uninstall the extension? What about if you install pykeops?

@gitbooo
Copy link
Author

gitbooo commented Aug 29, 2022

After doing multiple tests, I realized that the cauchy extension is not the problem (although it is strange that even after installing the extension, the code still returns "CUDA extension for cauchy multiplication not found"), but it is the second error that I cannot resolve:

Fatal error condition occurred in /opt/vcpkg/buildtrees/aws-c-io/src/9e6648842a-364b708815.clean/source/event_loop.c:72: aws_thread_launch(&cleanup_thread, s_event_loop_destroy_async_thread_fn, el_group, &thread_options) == AWS_OP_SUCCESS
Exiting Application
################################################################################
Stack trace:
################################################################################
/home/"""/.conda/envs/s4env/lib/python3.10/site-packages/pyarrow/libarrow.so.900(+0x200af06) [0x7f8411eaaf06]
/home/"""/.conda/envs/s4env/lib/python3.10/site-packages/pyarrow/libarrow.so.900(+0x20028e5) [0x7f8411ea28e5]
/home/"""/.conda/envs/s4env/lib/python3.10/site-packages/pyarrow/libarrow.so.900(+0x1f27e09) [0x7f8411dc7e09]
/home/"""/.conda/envs/s4env/lib/python3.10/site-packages/pyarrow/libarrow.so.900(+0x200ba3d) [0x7f8411eaba3d]
/home/"""/.conda/envs/s4env/lib/python3.10/site-packages/pyarrow/libarrow.so.900(+0x1f25948) [0x7f8411dc5948]
/home/"""/.conda/envs/s4env/lib/python3.10/site-packages/pyarrow/libarrow.so.900(+0x200ba3d) [0x7f8411eaba3d]
/home/"""/.conda/envs/s4env/lib/python3.10/site-packages/pyarrow/libarrow.so.900(+0x1ee0b46) [0x7f8411d80b46]
/home/"""/.conda/envs/s4env/lib/python3.10/site-packages/pyarrow/libarrow.so.900(+0x194546a) [0x7f84117e546a]
/lib/x86_64-linux-gnu/libc.so.6(+0x43161) [0x7f84893fb161]
/lib/x86_64-linux-gnu/libc.so.6(+0x4325a) [0x7f84893fb25a]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xee) [0x7f84893d9bfe]
python(+0x2125d4) [0x564a6c3b15d4]
/var/spool/slurmd/job2192901/slurm_script: line 18: 17719 Aborted


@albertfgu
Copy link
Contributor

I haven't seen this error before. Just to confirm, this happens even with the extension uninstalled? Does your environment work with other codebases? Outside of the extension, there is nothing fancy with requirements for this repository.

@gitbooo
Copy link
Author

gitbooo commented Aug 29, 2022

Yeah, the extension is not installed. however I'm getting this error at the end of training after epoch 9 is finished.

@danassou
Copy link

Hi, I also have the same error, at the end of the training (running python -m train experiment=forecasting/s4-informer-{etth,ettm,ecl,weather} ) :

Epoch 9: 100%|█▉| 1510/1511 [00:24<00:00, 62.27it/s, loss=0.0216, v_num=pbmZ, val/mse=0.421, val/loss=0.421, test/mse=0.266, test/loss=0.266, train/mse=0.0242, train/loss=0.0242Epoch 9, global step 4809: 'val/loss' was not in top 1                                                                                                                             
Epoch 9: 100%|██| 1511/1511 [00:24<00:00, 62.14it/s, loss=0.0216, v_num=pbmZ, val/mse=0.421, val/loss=0.421, test/mse=0.266, test/loss=0.266, train/mse=0.0231, train/loss=0.0231]
Fatal error condition occurred in /opt/vcpkg/buildtrees/aws-c-io/src/9e6648842a-364b708815.clean/source/event_loop.c:72: aws_thread_launch(&cleanup_thread, s_event_loop_destroy_async_thread_fn, el_group, &thread_options) == AWS_OP_SUCCESS
Exiting Application
################################################################################
Stack trace:
################################################################################
/home/.conda/envs/supergood_env_cluster/lib/python3.9/site-packages/pyarrow/libarrow.so.900(+0x200af06) [0x7fbaccc77f06]
/home/.conda/envs/supergood_env_cluster/lib/python3.9/site-packages/pyarrow/libarrow.so.900(+0x20028e5) [0x7fbaccc6f8e5]
/home/.conda/envs/supergood_env_cluster/lib/python3.9/site-packages/pyarrow/libarrow.so.900(+0x1f27e09) [0x7fbaccb94e09]
/home/.conda/envs/supergood_env_cluster/lib/python3.9/site-packages/pyarrow/libarrow.so.900(+0x200ba3d) [0x7fbaccc78a3d]
/home/.conda/envs/supergood_env_cluster/lib/python3.9/site-packages/pyarrow/libarrow.so.900(+0x1f25948) [0x7fbaccb92948]
/home/.conda/envs/supergood_env_cluster/lib/python3.9/site-packages/pyarrow/libarrow.so.900(+0x200ba3d) [0x7fbaccc78a3d]
/home/.conda/envs/supergood_env_cluster/lib/python3.9/site-packages/pyarrow/libarrow.so.900(+0x1ee0b46) [0x7fbaccb4db46]
/home/.conda/envs/supergood_env_cluster/lib/python3.9/site-packages/pyarrow/libarrow.so.900(+0x194546a) [0x7fbacc5b246a]
/lib/x86_64-linux-gnu/libc.so.6(+0x43031) [0x7fbb3f73f031]
/lib/x86_64-linux-gnu/libc.so.6(+0x4312a) [0x7fbb3f73f12a]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xee) [0x7fbb3f71dc8e]
python(+0x2010a0) [0x56320cea20a0]
Aborted (core dumped)

So it does train, but this is a strange ending assertion error.
After looking around, it seems that it is an error that is found by many people regarding aws-sdk-cpp, for example you can find it here:
huggingface/datasets#3310

@albertfgu
Copy link
Contributor

Thanks for the additional info! Does this error occur if you uninstall the datasets package then? Does it only happen with AWS?

@danassou
Copy link

I can't run the code without the datasets library since it's required - I'm getting no module found error if I do so. To clarify, I'm not running my code with AWS, I am using my university's cluster (I don't really understand why aws-related errors pop up to be honest!)

@albertfgu
Copy link
Contributor

You should be able to remove the dataset dependency by deleting the "lra" import from src/dataloaders/__init__.py

@gitbooo
Copy link
Author

gitbooo commented Aug 30, 2022

You should be able to remove the dataset dependency by deleting the "lra" import from src/dataloaders/__init__.py

The code seems working on CPU without errors. however, I getting a KeyError 'nvrtc' with pykeops installed. Can you provide us with the pykeops version that you are using?

@albertfgu
Copy link
Contributor

  1. Does it run when pykeops is uninstalled?
  2. Are you able to install the CUDA extension instead?
  3. Can you try version pip install pykeops==1.5? Later versions of pykeops sometimes cause installations errors for me.
  4. What happens if you follow the instructions on the pykeops page for testing the installation?

@gitbooo
Copy link
Author

gitbooo commented Aug 31, 2022

  • When pykeops is uninstalled it's working without any error but on CPU.
  • I followed the steps to install CUDA extension but I'm still receiving [2022-08-31 15:42:03,450][src.models.sequence.ss.kernel][WARNING] - CUDA extension for Cauchy multiplication not found. Install by going to extensions/cauchy/ and running python setup.py install. This should speed up end-to-end training by 10-50%
  • I'm not getting the KeyError 'nvrtc' error but I'm getting this instead:
RuntimeError: [KeOps] This KeOps shared object has been compiled without cuda support: 
 1) to perform computations on CPU, simply set tagHostDevice to 0
 2) to perform computations on GPU, please recompile the formula with a working version of cuda.
  • I passed the tests successfully

@farshchian
Copy link

farshchian commented Sep 4, 2022

I am also facing the exact same issue. @gitbooo have you found a solution?

@albertfgu
Copy link
Contributor

  1. Without pykeops, the code should still run on GPU. Is there a reason you can only use CPU?
  2. I don't know why the extension isn't working. One note is that it has to be installed for every environment (e.g. for different GPU, CUDA version, etc.). E.g. it doesn't work if different machines are sharing conda environments; you would need to create a separate conda environment for each environment type and install the extension in each one
  3. I've seen that message several times in the past and I think it was always caused by an improper install. Installing from a fresh environment and also installing the latest version of cmake was the solution (pip install pykeops==1.5 cmake)
  4. Were you able to comment out the datasets dependency? It should involve changing one line of code in src/dataloaders/__init__.py

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants