Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Aws::Crt::Io::ClientBootstrap destructor may launch thread at process exit, and crash #1809

Closed
2 tasks done
pitrou opened this issue Nov 9, 2021 · 18 comments
Closed
2 tasks done
Labels
bug This issue is a bug. needs-reproduction This issue needs reproduction. needs-review This issue or pull request needs review from a core team member. p2 This is a standard priority issue

Comments

@pitrou
Copy link

pitrou commented Nov 9, 2021

Confirm by changing [ ] to [x] below to ensure that it's a bug:

Describe the bug
We're getting a user report of a crash, seemingly at process shutdown, on Windows:
conda-forge/arrow-cpp-feedstock#567

Apparently the ClientBootstrap destructor can indirectly trigger the launch of a new thread using aws_thread_launch. The thread launch fails at process shutdown, at least on Windows, triggering an assertion error and therefore a process crash.

SDK version number
1.9.120

Platform/OS/Hardware/Device
Windows/10.0.17763
(also reported on CentOS 8 and Ubuntu: https://issues.apache.org/jira/browse/ARROW-15141)

To Reproduce (observed behavior)
Basically conda-forge/arrow-cpp-feedstock#567 (comment), but I'm not sure what the exact steps are (I'm not the original reporter).

Expected behavior
Failing to launch a thread at process shutdown should probably not crash the process.

@pitrou pitrou added bug This issue is a bug. needs-triage This issue or PR still needs to be triaged. labels Nov 9, 2021
@pitrou
Copy link
Author

pitrou commented Nov 9, 2021

@xhochy

@kylekeppler
Copy link

kylekeppler commented Nov 10, 2021

Does not happen with SDK version 1.8.186.

@jdblischak
Copy link

In case it could help narrow down the source of the bug, I tested a few different versions of aws-sdk-cpp on CentOS 7:

  • No issue: 1.1.186
  • Bug: 1.9.120, 1.9.140

@KaibaLopez
Copy link
Contributor

Hi @jdblischak ,
Can you share how you are reproducing this?
In the post linked by the op there seems to be a fix provided by conda-forge so I'm wondering if this is on their side rather than the sdk?

@KaibaLopez KaibaLopez added needs-reproduction This issue needs reproduction. response-requested Waiting on additional info and feedback. Will move to "closing-soon" in 10 days. and removed needs-triage This issue or PR still needs to be triaged. labels Dec 15, 2021
@KaibaLopez KaibaLopez self-assigned this Dec 15, 2021
@xhochy
Copy link

xhochy commented Dec 15, 2021

The fix on the conda-forge side was to revert back to the 1.8.186 SDK version. With the current issue, we cannot use a newer SDK on Windows.

@jdblischak
Copy link

@KaibaLopez Thanks for following up

In the post linked by the op there seems to be a fix provided by conda-forge so I'm wondering if this is on their side rather than the sdk?

As @xhochy commented, the conda-forge workaround is to pin to an older version of aws-sdk-cpp. Personally I fixed it by specifying aws-sdk-cpp=1.8.186=h9ad65fb_2 for my conda env.

Can you share how you are reproducing this?

I was able to reproduce the bug using the code below:

mamba create -n test-aws python=3.9 pandas=1.2 pyarrow=2.0 aws-sdk-cpp=1.9.120
conda activate test-aws
python test-arrow.py

where test-arrow.py is the reproducible example script copied from conda-forge/arrow-cpp-feedstock#567

import numpy as np
import pandas as pd

def test_error():

    df = pd.DataFrame(np.random.randint(0,100,size=(100, 4)), columns=list('ABCD'))

    df.to_parquet('test.parquet')

if __name__ == '__main__':
    test_error()

Here is the full error and traceback that I observe:

% python test-arrow.py
Fatal error condition occurred in /home/conda/feedstock_root/build_artifacts/aws-c-io_1633633131324/work/source/event_loop.c:72: aws_thread_launch(&cleanup_thread, s_event_loop_destroy_async_thread_fn, el_group, &thread_options) == AWS_OP_SUCCESS
Exiting Application
################################################################################
Stack trace:
################################################################################
~/mambaforge/envs/test-aws/lib/python3.9/site-packages/pyarrow/../../../././libaws-c-common.so.1(aws_backtrace_print+0x59) [0x2aaac1581f19]
~/mambaforge/envs/test-aws/lib/python3.9/site-packages/pyarrow/../../../././libaws-c-common.so.1(aws_fatal_assert+0x48) [0x2aaac1573098]
~/mambaforge/envs/test-aws/lib/python3.9/site-packages/pyarrow/../../.././././libaws-c-io.so.1.0.0(+0x10a43) [0x2aaac17bca43]
~/mambaforge/envs/test-aws/lib/python3.9/site-packages/pyarrow/../../../././libaws-c-common.so.1(aws_ref_count_release+0x1d) [0x2aaac1583fad]
~/mambaforge/envs/test-aws/lib/python3.9/site-packages/pyarrow/../../.././././libaws-c-io.so.1.0.0(+0xe35a) [0x2aaac17ba35a]
~/mambaforge/envs/test-aws/lib/python3.9/site-packages/pyarrow/../../../././libaws-c-common.so.1(aws_ref_count_release+0x1d) [0x2aaac1583fad]
~/mambaforge/envs/test-aws/lib/python3.9/site-packages/pyarrow/../../../././libaws-crt-cpp.so(_ZN3Aws3Crt2Io15ClientBootstrapD1Ev+0x3a) [0x2aaac1526f5a]
~/mambaforge/envs/test-aws/lib/python3.9/site-packages/pyarrow/../../.././libaws-cpp-sdk-core.so(+0x5f570) [0x2aaac0faa570]
/lib64/libc.so.6(+0x39c99) [0x2aaaab835c99]
/lib64/libc.so.6(+0x39ce7) [0x2aaaab835ce7]
/lib64/libc.so.6(__libc_start_main+0xfc) [0x2aaaab81e50c]
python(+0x20aa51) [0x55555575ea51]
Aborted

@github-actions github-actions bot removed the response-requested Waiting on additional info and feedback. Will move to "closing-soon" in 10 days. label Dec 16, 2021
@pitrou
Copy link
Author

pitrou commented Dec 16, 2021

Update: this error has been reported on Ubuntu and CentOS as well: https://issues.apache.org/jira/browse/ARROW-15141

@pitrou pitrou changed the title Aws::Crt::Io::ClientBootstrap destructor may launch thread at process exit, and crash on Windows Aws::Crt::Io::ClientBootstrap destructor may launch thread at process exit, and crash Dec 16, 2021
@jeroen
Copy link

jeroen commented Jan 6, 2022

@ihnorton have you run into something similar for tiledb?

@ihnorton
Copy link

ihnorton commented Jan 6, 2022

No, we are still on 1.8, and that backtrace does not ring a bell.

@asp200
Copy link

asp200 commented Jan 17, 2022

Hi. Just to say we have hit this exact same issue using v1.9.72 within our in-house build at MathWorks.

@ihnorton
Copy link

We also see this on windows now (while updating).

https://github.com/awslabs/aws-c-io/blob/b5cad3d21018e84a5084d6e191661fa604b49f0c/source/event_loop.c#L73-L75

  • aws_thread_launch uses the win32 CreateThread API:

https://github.com/awslabs/aws-c-common/blob/cba230815132f53206c501874e03a286765fb225/source/windows/thread.c#L258-L259

  • it is documented here that CreateThread is not valid when the process is exiting:

The ExitProcess, ExitThread, CreateThread, CreateRemoteThread functions, and a process that is starting (as the result of a call by CreateProcess) are serialized between each other within a process. Only one of these events can happen in an address space at a time. This means that the following restrictions hold:

  • you can see a backtrace here where this error message happens after ExitProcess has been called

zagto pushed a commit to zagto/arrow that referenced this issue Oct 7, 2022
Because the latest AWS SDK C++ has a problem:
aws/aws-sdk-cpp#1809

Authored-by: Sutou Kouhei <kou@clear-code.com>
Signed-off-by: Sutou Kouhei <kou@clear-code.com>
@kkraus14
Copy link

Was this resolved by awslabs/aws-c-io#515?

fatemehp pushed a commit to fatemehp/arrow that referenced this issue Oct 17, 2022
Because the latest AWS SDK C++ has a problem:
aws/aws-sdk-cpp#1809

Authored-by: Sutou Kouhei <kou@clear-code.com>
Signed-off-by: Sutou Kouhei <kou@clear-code.com>
@h-vetinari
Copy link

h-vetinari commented Oct 25, 2022

Was this resolved by awslabs/aws-c-io#515?

I don't know this repo, but from looking around, it seems that:

  • the referenced PR is part of aws-c-io 0.13.5
  • latest aws-sdk-cpp release seems to use aws-c-io 0.10.20

The last update of the aws-c-io version used in this repo was in June (NB: at which point 5 newer releases of aws-c-io would have been already available). Perhaps @sdavtaker (author of last update) can illuminate the process of what's necessary to update the respective dependencies.

This issue is biting us quite hard in conda-forge, made worse by the fact that aws-sdk-cpp 1.8 does not seem compatible anymore with current versions of the rest of the aws-c-* stack (which we need to unbundle for several reasons).

I also noticed still regular discussions about this problem in other repos, e.g. huggingface/datasets#3310

Furthermore: This bug also happens outside pyarrow, I incorporate AWS in a standalone Windows C-program and that crashes during exit.

So it would be really good if we could upgrade aws-c-io here and then determine if that actually fixes things...

@h-vetinari
Copy link

h-vetinari commented Dec 4, 2022

pyarrow 10.0.1 was just released in conda-forge, which is the first release where we're building against aws-sdk-cpp 1.9.* again after more than a year. Since we cannot test the failure reported here on our infra, I'd be very grateful if someone could verify that the problem does or doesn't reappear. 🙃

conda install -c conda-forge pyarrow=10

Edit: if things are fine, I'm happy to backport this to arrow 6.x-9.x.

@jdblischak
Copy link

Confirmed. Thanks @h-vetinari! See reproducible example at conda-forge/arrow-cpp-feedstock#567 (comment)

@jmklix jmklix removed their assignment Apr 17, 2023
@jmklix jmklix added p2 This is a standard priority issue and removed p1 This is a high priority issue labels May 1, 2023
@cardinotGV
Copy link

In case someone else is still facing it...

I had the same issue, but it was caused because Aws::ShutdownAPI was not being called correctly.

https://docs.aws.amazon.com/sdk-for-cpp/v1/developer-guide/basic-use.html

@jmklix
Copy link
Member

jmklix commented Sep 18, 2023

As @cardinotGV said please make sure that you are calling InitAPI and ShutdownAPI correctly:

#include <aws/core/Aws.h>
int main(int argc, char** argv)
{
   Aws::SDKOptions options;
   Aws::InitAPI(options);
   {
      // make your SDK calls here.
   }
   Aws::ShutdownAPI(options);
   return 0;
}

If you are still running into any crashes at process exit please let me know

@jmklix jmklix closed this as completed Sep 18, 2023
@github-actions
Copy link

⚠️COMMENT VISIBILITY WARNING⚠️

Comments on closed issues are hard for our team to see.
If you need more assistance, please either tag a team member or open a new issue that references this one.
If you wish to keep having a conversation with other community members under this issue feel free to do so.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug This issue is a bug. needs-reproduction This issue needs reproduction. needs-review This issue or pull request needs review from a core team member. p2 This is a standard priority issue
Projects
None yet
Development

No branches or pull requests