Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[C++] Fatal error condition occurred in aws_thread_launch #20039

Closed
asfimport opened this issue Dec 16, 2021 · 7 comments
Closed

[C++] Fatal error condition occurred in aws_thread_launch #20039

asfimport opened this issue Dec 16, 2021 · 7 comments

Comments

@asfimport
Copy link
Collaborator

asfimport commented Dec 16, 2021

Hi, I am getting randomly the following error when first running inference with a Tensorflow model and then writing the result to a .parquet file:

Fatal error condition occurred in /home/conda/feedstock_root/build_artifacts/aws-c-io_1633633131324/work/source/event_loop.c:72: aws_thread_launch(&cleanup_thread, s_event_loop_destroy_async_thread_fn, el_group, &thread_options) == AWS_OP_SUCCESS
Exiting Application
################################################################################
Stack trace:
################################################################################
/home/<user>/miniconda3/envs/spliceai_env/lib/python3.9/site-packages/pyarrow/../../../././libaws-c-common.so.1(aws_backtrace_print+0x59) [0x7ffb14235f19]
/home/<user>/miniconda3/envs/spliceai_env/lib/python3.9/site-packages/pyarrow/../../../././libaws-c-common.so.1(aws_fatal_assert+0x48) [0x7ffb14227098]
/home/<user>/miniconda3/envs/spliceai_env/lib/python3.9/site-packages/pyarrow/../../.././././libaws-c-io.so.1.0.0(+0x10a43) [0x7ffb1406ea43]
/home/<user>/miniconda3/envs/spliceai_env/lib/python3.9/site-packages/pyarrow/../../../././libaws-c-common.so.1(aws_ref_count_release+0x1d) [0x7ffb14237fad]
/home/<user>/miniconda3/envs/spliceai_env/lib/python3.9/site-packages/pyarrow/../../.././././libaws-c-io.so.1.0.0(+0xe35a) [0x7ffb1406c35a]
/home/<user>/miniconda3/envs/spliceai_env/lib/python3.9/site-packages/pyarrow/../../../././libaws-c-common.so.1(aws_ref_count_release+0x1d) [0x7ffb14237fad]
/home/<user>/miniconda3/envs/spliceai_env/lib/python3.9/site-packages/pyarrow/../../../././libaws-crt-cpp.so(_ZN3Aws3Crt2Io15ClientBootstrapD1Ev+0x3a) [0x7ffb142a2f5a]
/home/<user>/miniconda3/envs/spliceai_env/lib/python3.9/site-packages/pyarrow/../../.././libaws-cpp-sdk-core.so(+0x5f570) [0x7ffb147fd570]
/lib/x86_64-linux-gnu/libc.so.6(+0x49a27) [0x7ffb17f7da27]
/lib/x86_64-linux-gnu/libc.so.6(on_exit+0) [0x7ffb17f7dbe0]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xfa) [0x7ffb17f5b0ba]
/home/<user>/miniconda3/envs/spliceai_env/bin/python3.9(+0x20aa51) [0x562576609a51]
/bin/bash: line 1: 2341494 Aborted                 (core dumped)

My colleague ran into the same issue on Centos 8 while running the same job + same environment on SLURM, so I guess it could be some issue with tensorflow + pyarrow.

Also I found a github issue with multiple people running into the same issue:
huggingface/datasets#3310

 

It would be very important to my lab that this bug gets resolved, as we cannot work with parquet any more. Unfortunately, we do not have the knowledge to fix it.

Environment: - uname -a:
Linux datalab2 5.4.0-91-generic #102-Ubuntu SMP Fri Nov 5 16:31:28 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux

  • mamba list | grep -i "pyarrow\|tensorflow\|^python"
    pyarrow 6.0.0 py39hff6fa39_1_cpu conda-forge
    python 3.9.7 hb7a2778_3_cpython conda-forge
    python-dateutil 2.8.2 pyhd8ed1ab_0 conda-forge
    python-flatbuffers 1.12 pyhd8ed1ab_1 conda-forge
    python-irodsclient 1.0.0 pyhd8ed1ab_0 conda-forge
    python-rocksdb 0.7.0 py39h7fcd5f3_4 conda-forge
    python_abi 3.9 2_cp39 conda-forge
    tensorflow 2.6.2 cuda112py39h9333c2f_0 conda-forge
    tensorflow-base 2.6.2 cuda112py39h7de589b_0 conda-forge
    tensorflow-estimator 2.6.2 cuda112py39h9333c2f_0 conda-forge
    tensorflow-gpu 2.6.2 cuda112py39h0bbbad9_0 conda-forge

Reporter: F. H.
Assignee: Uwe Korn / @xhochy

Related issues:

Externally tracked issue: huggingface/datasets#3310

Note: This issue was originally created as ARROW-15141. Please see the migration documentation for further details.

@asfimport
Copy link
Collaborator Author

Antoine Pitrou / @pitrou:
Thanks for the report. This is very likely this issue: aws/aws-sdk-cpp#1809

@asfimport
Copy link
Collaborator Author

Antoine Pitrou / @pitrou:
Since you are using conda/mamba, a workaround should be to switch to an older version of aws-sdk-cpp such as aws-sdk-cpp=1.8.186.

@asfimport
Copy link
Collaborator Author

F. H.:
Thanks for the context @pitrou , we will try your suggestion :)

@asfimport
Copy link
Collaborator Author

F. H.:
Indeed installing aws-sdk-cpp=1.8.186 seems to have fixed the issue (y)

@asfimport
Copy link
Collaborator Author

Uwe Korn / @xhochy:
@pitrou I would simply rebuild all pyarrow conda versions with the old SDK again until I see a fix for this. It would be nice to have a reproducer for this on Linux in the conda recipe. Currently the code that fails on Windows passes there.

@asfimport
Copy link
Collaborator Author

H. Vetinari:
A couple of days ago we released arrow 10.0.1 in conda-forge with aws-sdk-cpp 1.9 (with a version that should have the fix for this issue). Could someone ( [~hoeze]?) verify if the problem still exists with v10? Then we could backport 1.9 change to the maintenance branches, which would make the situation with the openssl 3 migration much easier.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants