Error after installing CUDA extension for Cauchy multiplication #62

gitbooo · 2022-08-26T20:15:41Z

I'm trying to reproduce experiments but the code is retuning a KeyError 'nvrtc' and the warning [src.models.sequence.ss.kernel][WARNING] - CUDA extension for Cauchy multiplication not found still appearing.

Otherwise, I'm getting this erreur : Fatal error condition occurred in /opt/vcpkg/buildtrees/aws-c-io/src/9e6648842a-364b708815.clean/source/event_loop.c:72: aws_thread_launch(&cleanup_thread, s_event_loop_destroy_async_thread_fn, el_group, &thread_options) == AWS_OP_SUCCESS

The text was updated successfully, but these errors were encountered:

albertfgu · 2022-08-29T19:24:11Z

Can you elaborate on the 'nvrtc' error? Can you uninstall and reinstall the extension (pip uninstall cauchy-mult and cd extensions/cauchy && python setup.py install) and copy what it prints?

Does the code run if you completely uninstall the extension? What about if you install pykeops?

gitbooo · 2022-08-29T19:38:10Z

After doing multiple tests, I realized that the cauchy extension is not the problem (although it is strange that even after installing the extension, the code still returns "CUDA extension for cauchy multiplication not found"), but it is the second error that I cannot resolve:

Fatal error condition occurred in /opt/vcpkg/buildtrees/aws-c-io/src/9e6648842a-364b708815.clean/source/event_loop.c:72: aws_thread_launch(&cleanup_thread, s_event_loop_destroy_async_thread_fn, el_group, &thread_options) == AWS_OP_SUCCESS
Exiting Application
################################################################################
Stack trace:
################################################################################
/home/"""/.conda/envs/s4env/lib/python3.10/site-packages/pyarrow/libarrow.so.900(+0x200af06) [0x7f8411eaaf06]
/home/"""/.conda/envs/s4env/lib/python3.10/site-packages/pyarrow/libarrow.so.900(+0x20028e5) [0x7f8411ea28e5]
/home/"""/.conda/envs/s4env/lib/python3.10/site-packages/pyarrow/libarrow.so.900(+0x1f27e09) [0x7f8411dc7e09]
/home/"""/.conda/envs/s4env/lib/python3.10/site-packages/pyarrow/libarrow.so.900(+0x200ba3d) [0x7f8411eaba3d]
/home/"""/.conda/envs/s4env/lib/python3.10/site-packages/pyarrow/libarrow.so.900(+0x1f25948) [0x7f8411dc5948]
/home/"""/.conda/envs/s4env/lib/python3.10/site-packages/pyarrow/libarrow.so.900(+0x200ba3d) [0x7f8411eaba3d]
/home/"""/.conda/envs/s4env/lib/python3.10/site-packages/pyarrow/libarrow.so.900(+0x1ee0b46) [0x7f8411d80b46]
/home/"""/.conda/envs/s4env/lib/python3.10/site-packages/pyarrow/libarrow.so.900(+0x194546a) [0x7f84117e546a]
/lib/x86_64-linux-gnu/libc.so.6(+0x43161) [0x7f84893fb161]
/lib/x86_64-linux-gnu/libc.so.6(+0x4325a) [0x7f84893fb25a]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xee) [0x7f84893d9bfe]
python(+0x2125d4) [0x564a6c3b15d4]
/var/spool/slurmd/job2192901/slurm_script: line 18: 17719 Aborted

albertfgu · 2022-08-29T19:56:25Z

I haven't seen this error before. Just to confirm, this happens even with the extension uninstalled? Does your environment work with other codebases? Outside of the extension, there is nothing fancy with requirements for this repository.

gitbooo · 2022-08-29T20:47:30Z

Yeah, the extension is not installed. however I'm getting this error at the end of training after epoch 9 is finished.

danassou · 2022-08-29T20:56:08Z

Hi, I also have the same error, at the end of the training (running python -m train experiment=forecasting/s4-informer-{etth,ettm,ecl,weather} ) :

Epoch 9: 100%|█▉| 1510/1511 [00:24<00:00, 62.27it/s, loss=0.0216, v_num=pbmZ, val/mse=0.421, val/loss=0.421, test/mse=0.266, test/loss=0.266, train/mse=0.0242, train/loss=0.0242Epoch 9, global step 4809: 'val/loss' was not in top 1                                                                                                                             
Epoch 9: 100%|██| 1511/1511 [00:24<00:00, 62.14it/s, loss=0.0216, v_num=pbmZ, val/mse=0.421, val/loss=0.421, test/mse=0.266, test/loss=0.266, train/mse=0.0231, train/loss=0.0231]
Fatal error condition occurred in /opt/vcpkg/buildtrees/aws-c-io/src/9e6648842a-364b708815.clean/source/event_loop.c:72: aws_thread_launch(&cleanup_thread, s_event_loop_destroy_async_thread_fn, el_group, &thread_options) == AWS_OP_SUCCESS
Exiting Application
################################################################################
Stack trace:
################################################################################
/home/.conda/envs/supergood_env_cluster/lib/python3.9/site-packages/pyarrow/libarrow.so.900(+0x200af06) [0x7fbaccc77f06]
/home/.conda/envs/supergood_env_cluster/lib/python3.9/site-packages/pyarrow/libarrow.so.900(+0x20028e5) [0x7fbaccc6f8e5]
/home/.conda/envs/supergood_env_cluster/lib/python3.9/site-packages/pyarrow/libarrow.so.900(+0x1f27e09) [0x7fbaccb94e09]
/home/.conda/envs/supergood_env_cluster/lib/python3.9/site-packages/pyarrow/libarrow.so.900(+0x200ba3d) [0x7fbaccc78a3d]
/home/.conda/envs/supergood_env_cluster/lib/python3.9/site-packages/pyarrow/libarrow.so.900(+0x1f25948) [0x7fbaccb92948]
/home/.conda/envs/supergood_env_cluster/lib/python3.9/site-packages/pyarrow/libarrow.so.900(+0x200ba3d) [0x7fbaccc78a3d]
/home/.conda/envs/supergood_env_cluster/lib/python3.9/site-packages/pyarrow/libarrow.so.900(+0x1ee0b46) [0x7fbaccb4db46]
/home/.conda/envs/supergood_env_cluster/lib/python3.9/site-packages/pyarrow/libarrow.so.900(+0x194546a) [0x7fbacc5b246a]
/lib/x86_64-linux-gnu/libc.so.6(+0x43031) [0x7fbb3f73f031]
/lib/x86_64-linux-gnu/libc.so.6(+0x4312a) [0x7fbb3f73f12a]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xee) [0x7fbb3f71dc8e]
python(+0x2010a0) [0x56320cea20a0]
Aborted (core dumped)

So it does train, but this is a strange ending assertion error.
After looking around, it seems that it is an error that is found by many people regarding aws-sdk-cpp, for example you can find it here:
huggingface/datasets#3310

albertfgu · 2022-08-29T21:13:40Z

Thanks for the additional info! Does this error occur if you uninstall the datasets package then? Does it only happen with AWS?

danassou · 2022-08-29T21:24:56Z

I can't run the code without the datasets library since it's required - I'm getting no module found error if I do so. To clarify, I'm not running my code with AWS, I am using my university's cluster (I don't really understand why aws-related errors pop up to be honest!)

albertfgu · 2022-08-29T22:58:19Z

You should be able to remove the dataset dependency by deleting the "lra" import from src/dataloaders/__init__.py

gitbooo · 2022-08-30T20:34:32Z

You should be able to remove the dataset dependency by deleting the "lra" import from src/dataloaders/__init__.py

The code seems working on CPU without errors. however, I getting a KeyError 'nvrtc' with pykeops installed. Can you provide us with the pykeops version that you are using?

albertfgu · 2022-08-30T22:15:37Z

Does it run when pykeops is uninstalled?
Are you able to install the CUDA extension instead?
Can you try version pip install pykeops==1.5? Later versions of pykeops sometimes cause installations errors for me.
What happens if you follow the instructions on the pykeops page for testing the installation?

gitbooo · 2022-08-31T19:43:50Z

When pykeops is uninstalled it's working without any error but on CPU.
I followed the steps to install CUDA extension but I'm still receiving [2022-08-31 15:42:03,450][src.models.sequence.ss.kernel][WARNING] - CUDA extension for Cauchy multiplication not found. Install by going to extensions/cauchy/ and running python setup.py install. This should speed up end-to-end training by 10-50%
I'm not getting the KeyError 'nvrtc' error but I'm getting this instead:

RuntimeError: [KeOps] This KeOps shared object has been compiled without cuda support: 
 1) to perform computations on CPU, simply set tagHostDevice to 0
 2) to perform computations on GPU, please recompile the formula with a working version of cuda.

I passed the tests successfully

farshchian · 2022-09-04T18:25:35Z

I am also facing the exact same issue. @gitbooo have you found a solution?

albertfgu · 2022-09-07T19:46:49Z

Without pykeops, the code should still run on GPU. Is there a reason you can only use CPU?
I don't know why the extension isn't working. One note is that it has to be installed for every environment (e.g. for different GPU, CUDA version, etc.). E.g. it doesn't work if different machines are sharing conda environments; you would need to create a separate conda environment for each environment type and install the extension in each one
I've seen that message several times in the past and I think it was always caused by an improper install. Installing from a fresh environment and also installing the latest version of cmake was the solution (pip install pykeops==1.5 cmake)
Were you able to comment out the datasets dependency? It should involve changing one line of code in src/dataloaders/__init__.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error after installing CUDA extension for Cauchy multiplication #62

Error after installing CUDA extension for Cauchy multiplication #62

gitbooo commented Aug 26, 2022 •

edited

Loading

albertfgu commented Aug 29, 2022

gitbooo commented Aug 29, 2022

albertfgu commented Aug 29, 2022

gitbooo commented Aug 29, 2022

danassou commented Aug 29, 2022

albertfgu commented Aug 29, 2022

danassou commented Aug 29, 2022

albertfgu commented Aug 29, 2022

gitbooo commented Aug 30, 2022 •

edited

Loading

albertfgu commented Aug 30, 2022

gitbooo commented Aug 31, 2022 •

edited

Loading

farshchian commented Sep 4, 2022 •

edited

Loading

albertfgu commented Sep 7, 2022

Error after installing CUDA extension for Cauchy multiplication #62

Error after installing CUDA extension for Cauchy multiplication #62

Comments

gitbooo commented Aug 26, 2022 • edited Loading

albertfgu commented Aug 29, 2022

gitbooo commented Aug 29, 2022

albertfgu commented Aug 29, 2022

gitbooo commented Aug 29, 2022

danassou commented Aug 29, 2022

albertfgu commented Aug 29, 2022

danassou commented Aug 29, 2022

albertfgu commented Aug 29, 2022

gitbooo commented Aug 30, 2022 • edited Loading

albertfgu commented Aug 30, 2022

gitbooo commented Aug 31, 2022 • edited Loading

farshchian commented Sep 4, 2022 • edited Loading

albertfgu commented Sep 7, 2022

gitbooo commented Aug 26, 2022 •

edited

Loading

gitbooo commented Aug 30, 2022 •

edited

Loading

gitbooo commented Aug 31, 2022 •

edited

Loading

farshchian commented Sep 4, 2022 •

edited

Loading