Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fatal error condition occurred in aws-c-io #3310

Closed
Crabzmatic opened this issue Nov 22, 2021 · 28 comments
Closed

Fatal error condition occurred in aws-c-io #3310

Crabzmatic opened this issue Nov 22, 2021 · 28 comments
Labels
bug Something isn't working

Comments

@Crabzmatic
Copy link

Describe the bug

Fatal error when using the library

Steps to reproduce the bug

from datasets import load_dataset
dataset = load_dataset('wikiann', 'en')

Expected results

No fatal errors

Actual results

Fatal error condition occurred in D:\bld\aws-c-io_1633633258269\work\source\event_loop.c:74: aws_thread_launch(&cleanup_thread, s_event_loop_destroy_async_thread_fn, el_group, &thread_options) == AWS_OP_SUCCESS
Exiting Application

Environment info

  • datasets version: 1.15.2.dev0
  • Platform: Windows-10-10.0.22504-SP0
  • Python version: 3.8.12
  • PyArrow version: 6.0.0
@lhoestq
Copy link
Member

lhoestq commented Nov 24, 2021

Hi ! Are you having this issue only with this specific dataset, or it also happens with other ones like squad ?

@Crabzmatic
Copy link
Author

@lhoestq It happens also on squad. It successfully downloads the whole dataset and then crashes on:

Fatal error condition occurred in D:\bld\aws-c-io_1633633258269\work\source\event_loop.c:74: aws_thread_launch(&cleanup_thread, s_event_loop_destroy_async_thread_fn, el_group, &thread_options) == AWS_OP_SUCCESS
Exiting Application

I tested it on Ubuntu and its working OK. Didn't test on non-preview version of Windows 11, Windows-10-10.0.22504-SP0 is a preview version, not sure if this is causing it.

@leehaust
Copy link

I see the same error in Windows-10.0.19042 as of a few days ago:

Fatal error condition occurred in D:\bld\aws-c-io_1633633258269\work\source\event_loop.c:74: aws_thread_launch(&cleanup_thread, s_event_loop_destroy_async_thread_fn, el_group, &thread_options) == AWS_OP_SUCCESS

python 3.8.12 h7840368_2_cpython conda-forge
boto3 1.20.11 pyhd8ed1ab_0 conda-forge
botocore 1.23.11 pyhd8ed1ab_0 conda-forge

...but I am not using datasets (although I might take a look now that I know about it!)

The error has occurred a few times over the last two days, but not consistently enough for me to get it with DEBUG. If there is any interest I can report back here, but it seems not unique to datasets.

@lhoestq
Copy link
Member

lhoestq commented Nov 29, 2021

I'm not sure what datasets has to do with a crash that seems related to aws-c-io, could it be an issue with your environment ?

@leehaust
Copy link

I'm not sure what datasets has to do with a crash that seems related to aws-c-io, could it be an issue with your environment ?

Agreed, this issue is not likely a bug in datasets, since I get the identical error without datasets installed.

@Crabzmatic
Copy link
Author

Crabzmatic commented Nov 29, 2021

Will close this issue. Bug in aws-c-io shouldn't be in datasets repo. Nevertheless, it can be useful to know that it happens. Thanks @leehaust @lhoestq

@vermouthmjl
Copy link

I have also had this issue since a few days, when running scripts using PyCharm in particular, but it does not seem to affect the script from running, only reporting this error at the end of the run.

@CallumMcMahon
Copy link

CallumMcMahon commented Dec 6, 2021

I also get this issue, It appears after my script has finished running. I get the following error message

Fatal error condition occurred in /home/conda/feedstock_root/build_artifacts/aws-c-io_1637179816120/work/source/event_loop.c:72: aws_thread_launch(&cleanup_thread, s_event_loop_destroy_async_thread_fn, el_group, &thread_options) == AWS_OP_SUCCESS
Exiting Application
################################################################################
Stack trace:
################################################################################
/home/user_name/conda_envs/env_name/lib/python3.7/site-packages/pyarrow/../../../././libaws-c-common.so.1(aws_backtrace_print+0x59) [0x2aabe0479579]
/home/user_name/conda_envs/env_name/lib/python3.7/site-packages/pyarrow/../../../././libaws-c-common.so.1(aws_fatal_assert+0x48) [0x2aabe04696c8]
/home/user_name/conda_envs/env_name/lib/python3.7/site-packages/pyarrow/../../.././././libaws-c-io.so.1.0.0(+0x13ad3) [0x2aabe0624ad3]
/home/user_name/conda_envs/env_name/lib/python3.7/site-packages/pyarrow/../../../././libaws-c-common.so.1(aws_ref_count_release+0x1d) [0x2aabe047b60d]
/home/user_name/conda_envs/env_name/lib/python3.7/site-packages/pyarrow/../../.././././libaws-c-io.so.1.0.0(+0x113ca) [0x2aabe06223ca]
/home/user_name/conda_envs/env_name/lib/python3.7/site-packages/pyarrow/../../../././libaws-c-common.so.1(aws_ref_count_release+0x1d) [0x2aabe047b60d]
/home/user_name/conda_envs/env_name/lib/python3.7/site-packages/pyarrow/../../../././libaws-crt-cpp.so(_ZN3Aws3Crt2Io15ClientBootstrapD1Ev+0x3a) [0x2aabe041cf5a]
/home/user_name/conda_envs/env_name/lib/python3.7/site-packages/pyarrow/../../.././libaws-cpp-sdk-core.so(+0x5f570) [0x2aabe00eb570]
/lib64/libc.so.6(+0x39ce9) [0x2aaaab835ce9]
/lib64/libc.so.6(+0x39d37) [0x2aaaab835d37]
/lib64/libc.so.6(__libc_start_main+0xfc) [0x2aaaab81e55c]
python(+0x1c721d) [0x55555571b21d]
Aborted

I don't get this issue when running my code in a container, and it seems more relevant to PyArrow but thought a more complete stack trace might be helpful to someone

@Hoeze
Copy link

Hoeze commented Dec 16, 2021

I created an issue on JIRA:
https://issues.apache.org/jira/browse/ARROW-15141

@xhochy
Copy link

xhochy commented Dec 17, 2021

@CallumMcMahon Do you have a small reproducer for this problem on Linux? I can reproduce this on Windows but sadly not with linux.

@Marzipan78
Copy link

Any updates on this issue? I started receiving the same error a few days ago on the amazon reviews

@RuurdBeerstra
Copy link

Hi,

I also ran into this issue, Windows only. It caused our massive binary to minidump left and right, very annoying.
When the program is doing an exit, the destructors in the exit-handlers want to do cleanup, leading to code in event_loop.c, on line 73-ish:

AWS_FATAL_ASSERT(
aws_thread_launch(&cleanup_thread, s_event_loop_destroy_async_thread_fn, el_group, &thread_options) ==
AWS_OP_SUCCESS);

The fatal_assert end in an abort/minidump.

Digging through the code, I found that aws_thread_launch in the Windows version (aws-c-common/source/windows/thread.c) has only ONE reason to return anything other than AWS_OP_SUCCESS:

return aws_raise_error(AWS_ERROR_THREAD_INSUFFICIENT_RESOURCE);

on line 263, when CreateThread fails. Our conclusion was that, apparently, Windows dislikes launching a new thread while already handling the exit-handlers. And while I appreciate the the fatal_assert is there in case of problems, the cure here is worse than the problem.

I "fixed" this in our (Windows) environment by (bluntly) removing the AWS_FATAL_ASSERT. If Windows cannot start a thread, the program is in deep trouble anyway and the chances of that actually happening are acceptable (to us).
The exit is going to clean up all resources anyway.

A neater fix would probably be to detect somehow that the program is actually in the process of exiting and then not bother (on windows, anyway) to start a cleanup thread. Alternatively, try to start the thread but not fatal-assert when it fails during exit. Or perhaps Windows can be convinced somehow to start the thread under these circumstances?

@xhochy : The problem is Windows-only, the aws_thread_launch has two implementations (posix and windows). The problem is in the windows CreateThread which fails.

@leileilin
Copy link

leileilin commented Aug 25, 2022

I also encountered the same problem, but I made an error in the multi gpu training environment on Linux, and the single gpu training environment will not make an error.
i use accelerate package to do multi gpu training.

@gitbooo
Copy link

gitbooo commented Aug 26, 2022

I also get this issue, It appears after my script has finished running. I get the following error message

Fatal error condition occurred in /home/conda/feedstock_root/build_artifacts/aws-c-io_1637179816120/work/source/event_loop.c:72: aws_thread_launch(&cleanup_thread, s_event_loop_destroy_async_thread_fn, el_group, &thread_options) == AWS_OP_SUCCESS
Exiting Application
################################################################################
Stack trace:
################################################################################
/home/user_name/conda_envs/env_name/lib/python3.7/site-packages/pyarrow/../../../././libaws-c-common.so.1(aws_backtrace_print+0x59) [0x2aabe0479579]
/home/user_name/conda_envs/env_name/lib/python3.7/site-packages/pyarrow/../../../././libaws-c-common.so.1(aws_fatal_assert+0x48) [0x2aabe04696c8]
/home/user_name/conda_envs/env_name/lib/python3.7/site-packages/pyarrow/../../.././././libaws-c-io.so.1.0.0(+0x13ad3) [0x2aabe0624ad3]
/home/user_name/conda_envs/env_name/lib/python3.7/site-packages/pyarrow/../../../././libaws-c-common.so.1(aws_ref_count_release+0x1d) [0x2aabe047b60d]
/home/user_name/conda_envs/env_name/lib/python3.7/site-packages/pyarrow/../../.././././libaws-c-io.so.1.0.0(+0x113ca) [0x2aabe06223ca]
/home/user_name/conda_envs/env_name/lib/python3.7/site-packages/pyarrow/../../../././libaws-c-common.so.1(aws_ref_count_release+0x1d) [0x2aabe047b60d]
/home/user_name/conda_envs/env_name/lib/python3.7/site-packages/pyarrow/../../../././libaws-crt-cpp.so(_ZN3Aws3Crt2Io15ClientBootstrapD1Ev+0x3a) [0x2aabe041cf5a]
/home/user_name/conda_envs/env_name/lib/python3.7/site-packages/pyarrow/../../.././libaws-cpp-sdk-core.so(+0x5f570) [0x2aabe00eb570]
/lib64/libc.so.6(+0x39ce9) [0x2aaaab835ce9]
/lib64/libc.so.6(+0x39d37) [0x2aaaab835d37]
/lib64/libc.so.6(__libc_start_main+0xfc) [0x2aaaab81e55c]
python(+0x1c721d) [0x55555571b21d]
Aborted

I don't get this issue when running my code in a container, and it seems more relevant to PyArrow but thought a more complete stack trace might be helpful to someone

Any updates for your issue because I'm getting the same one

@BramVanroy
Copy link
Contributor

BramVanroy commented Sep 6, 2022

Potentially related AWS issue: aws/aws-sdk-cpp#1809

Ran into this issue today while training a BPE tokenizer on a dataset.

Train code:

"""Train a ByteLevelBPETokenizer based on a given dataset. The dataset must be on the HF Hub.
This script is adaptated from the Transformers example in https://github.com/huggingface/transformers/tree/main/examples/flax/language-modeling
"""
from os import PathLike
from pathlib import Path
from typing import Sequence, Union

from datasets import load_dataset
from tokenizers import ByteLevelBPETokenizer


def train_tokenizer(dataset_name: str = "oscar", dataset_config_name: str = "unshuffled_deduplicated_nl",
                    dataset_split: str = "train", dataset_textcol: str = "text",
                    vocab_size: int = 50265,  min_frequency: int = 2,
                    special_tokens: Sequence[str] = ("<s>", "<pad>", "</s>", "<unk>", "<mask>"),
                    dout: Union[str, PathLike] = "."):
    # load dataset
    dataset = load_dataset(dataset_name, dataset_config_name, split=dataset_split)
    # Instantiate tokenizer
    tokenizer = ByteLevelBPETokenizer()

    def batch_iterator(batch_size=1024):
        for i in range(0, len(dataset), batch_size):
            yield dataset[i: i + batch_size][dataset_textcol]

    # Customized training
    tokenizer.train_from_iterator(batch_iterator(), vocab_size=vocab_size, min_frequency=min_frequency,
                                  special_tokens=special_tokens)

    # Save to disk
    pdout = Path(dout).resolve()
    pdout.mkdir(exist_ok=True, parents=True)
    tokenizer.save_model(str(pdout))


def main():
    import argparse
    cparser = argparse.ArgumentParser(description=__doc__, formatter_class=argparse.ArgumentDefaultsHelpFormatter)

    cparser.add_argument("dataset_name", help="Name of dataset to use for tokenizer training")
    cparser.add_argument("--dataset_config_name", default=None,
                         help="Name of the config to use for tokenizer training")
    cparser.add_argument("--dataset_split", default=None,
                         help="Name of the split to use for tokenizer training (typically 'train')")
    cparser.add_argument("--dataset_textcol", default="text",
                         help="Name of the text column to use for tokenizer training")
    cparser.add_argument("--vocab_size", type=int, default=50265, help="Vocabulary size")
    cparser.add_argument("--min_frequency", type=int, default=2, help="Minimal frequency of tokens")
    cparser.add_argument("--special_tokens", nargs="+", default=["<s>", "<pad>", "</s>", "<unk>", "<mask>"],
                         help="Special tokens to add. Useful for specific training objectives. Note that if you wish"
                              " to use this tokenizer with a default transformers.BartConfig, then make sure that the"
                              " order of at least these special tokens are correct: BOS (0), padding (1), EOS (2)")
    cparser.add_argument("--dout", default=".", help="Path to directory to save tokenizer.json file")

    train_tokenizer(**vars(cparser.parse_args()))


if __name__ == "__main__":
    main()

Command:

$WDIR="your_tokenizer"
python prepare_tokenizer.py dbrd --dataset_config_name plain_text --dataset_split unsupervised --dout $WDIR

Output:

Reusing dataset dbrd (cache/datasets/dbrd/plain_text/3.0.0/2b12e31348489dfe586c2d0f40694e5d9f9454c9468457ac9f1b51abf686eeb3)
[00:00:30] Pre-processing sequences                 ████████ 0        /        0
[00:00:00] Tokenize words                           ████████ 333319   /   333319
[00:01:06] Count pairs                              ████████ 333319   /   333319
[00:00:03] Compute merges                           ████████ 50004    /    50004

Fatal error condition occurred in /opt/vcpkg/buildtrees/aws-c-io/src/9e6648842a-364b708815.clean/source/event_loop.c:72: aws_thread_launch(&cleanup_thread, s_event_loop_destroy_async_thread_fn, el_group, &thread_options) == AWS_OP_SUCCESS
Exiting Application
################################################################################
Stack trace:
################################################################################
venv/lib/python3.9/site-packages/pyarrow/libarrow.so.900(+0x200af06) [0x155106589f06]
venv/lib/python3.9/site-packages/pyarrow/libarrow.so.900(+0x20028e5) [0x1551065818e5]
venv/lib/python3.9/site-packages/pyarrow/libarrow.so.900(+0x1f27e09) [0x1551064a6e09]
venv/lib/python3.9/site-packages/pyarrow/libarrow.so.900(+0x200ba3d) [0x15510658aa3d]
venv/lib/python3.9/site-packages/pyarrow/libarrow.so.900(+0x1f25948) [0x1551064a4948]
venv/lib/python3.9/site-packages/pyarrow/libarrow.so.900(+0x200ba3d) [0x15510658aa3d]
venv/lib/python3.9/site-packages/pyarrow/libarrow.so.900(+0x1ee0b46) [0x15510645fb46]
venv/lib/python3.9/site-packages/pyarrow/libarrow.so.900(+0x194546a) [0x155105ec446a]
/lib64/libc.so.6(+0x39b0c) [0x1551075b8b0c]
/lib64/libc.so.6(on_exit+0) [0x1551075b8c40]
/lib64/libc.so.6(__libc_start_main+0xfa) [0x1551075a249a]
python(_start+0x2e) [0x4006ce]
Aborted (core dumped)

Running on datasets==2.4.0 and pyarrow==9.0.0 on RHEL 8.

@lhoestq
Copy link
Member

lhoestq commented Sep 6, 2022

There is also a discussion here https://issues.apache.org/jira/browse/ARROW-15141 where it is suggested for conda users to use an older version of aws-sdk-cpp: aws-sdk-cpp=1.8.186

@boru-roylu
Copy link

boru-roylu commented Sep 14, 2022

Downgrading pyarrow to 6.0.1 solves the issue for me.

pip install pyarrow==6.0.1

@RuurdBeerstra
Copy link

RuurdBeerstra commented Sep 15, 2022 via email

@h-vetinari
Copy link

First of all, I’d never call a downgrade a solution, at most a (very) temporary workaround.

Very much so! It looks like an apparent fix for the underlying problem might have landed, but it sounds like it might still be a bit of a lift to get it into aws-sdk-cpp.

Downgrading pyarrow to 6.0.1 solves the issue for me.

Sidenote: On conda-forge side, all recent pyarrow releases (all the way up to v9 and soon v10) have carried the respective pin and will not run into this issue.

conda install -c conda-forge pyarrow

@kngwyu
Copy link

kngwyu commented Oct 26, 2022

For pip people, I confirmed that installing the nightly version of pyarrow also solves this by: pip install --extra-index-url https://pypi.fury.io/arrow-nightlies/ --prefer-binary --pre pyarrow --upgrade. (See https://arrow.apache.org/docs/python/install.html#installing-nightly-packages)
Any version after apache/arrow#14157 would work fine.

@h-vetinari
Copy link

Furthermore: This bug also happens outside pyarrow, I incorporate AWS in a standalone Windows C-program and that crashes during exit.

Do you have a reproducer you could share? I'd like to test if the new versions that supposedly solve this actually do, but we don't have a way to test it...

@RuurdBeerstra
Copy link

RuurdBeerstra commented Oct 28, 2022 via email

@h-vetinari
Copy link

No – sorry. It is part of a massive eco-system which cannot easily be shared.

OK, was worth a try...

The fix I applied simply removes that fatal assert, that solves the problem for me.

This seems to be what awslabs/aws-c-io#515 did upstream.

I’ll try and upgrade to the latest AWS version and report my findings, but that will be after I return from a month of vacationing…

caution: aws-sdk-cpp hasn't yet upgraded its bundled(?) aws-c-io and hence doesn't contain the fix AFAICT

@Charon-HN
Copy link

Charon-HN commented Oct 29, 2022

Hi, I also encountered the same problem, but I made an error on Ubuntu without using datasets as @Crabzmatic he wrote.

At that time, I find my version of pyarrow is 9.0.0, which is different from as follow:

#3310 (comment)
Downgrading pyarrow to 6.0.1 solves the issue for me.

pip install pyarrow==6.0.1

As it happens, I found this error message when I introduced the Trainer of HuggingFace

For example, I write following code:

from transformers import Trainer
print('Hugging Face')

I get the following error message:

Hugging Face
Fatal error condition occurred in /opt/vcpkg/buildtrees/aws-c-io/src/9e6648842a-364b708815.clean/source/event_loop.c:72: aws_thread_launch(&cleanup_thread, s_event_loop_destroy_async_thread_fn, el_group, &thread_options) == AWS_OP_SUCCESS
Exiting Application
################################################################################
Stack trace:
################################################################################
/home/ubuntu/anaconda3/envs/pytorch38/lib/python3.8/site-packages/pyarrow/libarrow.so.900(+0x200af06) [0x7fa9add1df06]
/home/ubuntu/anaconda3/envs/pytorch38/lib/python3.8/site-packages/pyarrow/libarrow.so.900(+0x20028e5) [0x7fa9add158e5]
/home/ubuntu/anaconda3/envs/pytorch38/lib/python3.8/site-packages/pyarrow/libarrow.so.900(+0x1f27e09) [0x7fa9adc3ae09]
/home/ubuntu/anaconda3/envs/pytorch38/lib/python3.8/site-packages/pyarrow/libarrow.so.900(+0x200ba3d) [0x7fa9add1ea3d]
/home/ubuntu/anaconda3/envs/pytorch38/lib/python3.8/site-packages/pyarrow/libarrow.so.900(+0x1f25948) [0x7fa9adc38948]
/home/ubuntu/anaconda3/envs/pytorch38/lib/python3.8/site-packages/pyarrow/libarrow.so.900(+0x200ba3d) [0x7fa9add1ea3d]
/home/ubuntu/anaconda3/envs/pytorch38/lib/python3.8/site-packages/pyarrow/libarrow.so.900(+0x1ee0b46) [0x7fa9adbf3b46]
/home/ubuntu/anaconda3/envs/pytorch38/lib/python3.8/site-packages/pyarrow/libarrow.so.900(+0x194546a) [0x7fa9ad65846a]
/lib/x86_64-linux-gnu/libc.so.6(+0x468d7) [0x7faa2fcfe8d7]
/lib/x86_64-linux-gnu/libc.so.6(on_exit+0) [0x7faa2fcfea90]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xfa) [0x7faa2fcdc0ba]
/home/ubuntu/anaconda3/envs/pytorch38/bin/python(+0x1f9ad7) [0x5654571d1ad7]

But, when I remove the Trainer module from transformers, everthing is OK.

So Why ?

Environment info

  • Platform: Ubuntu 18
  • Python version: 3.8
  • PyArrow version: 9.0.0
  • transformers: 4.22.1
  • simpletransformers: 0.63.9

@h-vetinari
Copy link

I get the following error message:

Not sure what's going on, but that shouldn't happen, especially as we're pinning to a version that should avoid this.

Can you please open an issue https://github.com/conda-forge/arrow-cpp-feedstock, including the requested output of conda list & conda info?

@h-vetinari
Copy link

pyarrow 10.0.1 was just released in conda-forge, which is the first release where we're building against aws-sdk-cpp 1.9.* again after more than a year. Since we cannot test the failure reported here on our infra, I'd be very grateful if someone could verify that the problem does or doesn't reappear. 🙃

conda install -c conda-forge pyarrow=10

@liuchaoqun
Copy link

pyarrow 10.0.1 was just released in conda-forge, which is the first release where we're building against aws-sdk-cpp 1.9.* again after more than a year. Since we cannot test the failure reported here on our infra, I'd be very grateful if someone could verify that the problem does or doesn't reappear. 🙃

conda install -c conda-forge pyarrow=10

The problem is gone after I install the new version. Thanks!
pip install pyarrow==10

@h-vetinari
Copy link

@liuchaoqun, with pip install pyarrow you don't get aws-bindings, they're too complicated to package into wheels as far as I know. And even if they're packaged, at the time of the release of pyarrow 10 it would have still been pinned to aws 1.8 for the same reasons as in this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests