-
Notifications
You must be signed in to change notification settings - Fork 2.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fatal error condition occurred in aws-c-io #3310
Comments
Hi ! Are you having this issue only with this specific dataset, or it also happens with other ones like |
@lhoestq It happens also on
I tested it on Ubuntu and its working OK. Didn't test on non-preview version of Windows 11, |
I see the same error in Windows-10.0.19042 as of a few days ago:
python 3.8.12 h7840368_2_cpython conda-forge ...but I am not using The error has occurred a few times over the last two days, but not consistently enough for me to get it with DEBUG. If there is any interest I can report back here, but it seems not unique to |
I'm not sure what |
Agreed, this issue is not likely a bug in datasets, since I get the identical error without datasets installed. |
I have also had this issue since a few days, when running scripts using PyCharm in particular, but it does not seem to affect the script from running, only reporting this error at the end of the run. |
I also get this issue, It appears after my script has finished running. I get the following error message
I don't get this issue when running my code in a container, and it seems more relevant to PyArrow but thought a more complete stack trace might be helpful to someone |
I created an issue on JIRA: |
@CallumMcMahon Do you have a small reproducer for this problem on Linux? I can reproduce this on Windows but sadly not with linux. |
Any updates on this issue? I started receiving the same error a few days ago on the amazon reviews |
Hi, I also ran into this issue, Windows only. It caused our massive binary to minidump left and right, very annoying. AWS_FATAL_ASSERT( The fatal_assert end in an abort/minidump. Digging through the code, I found that aws_thread_launch in the Windows version (aws-c-common/source/windows/thread.c) has only ONE reason to return anything other than AWS_OP_SUCCESS: return aws_raise_error(AWS_ERROR_THREAD_INSUFFICIENT_RESOURCE); on line 263, when CreateThread fails. Our conclusion was that, apparently, Windows dislikes launching a new thread while already handling the exit-handlers. And while I appreciate the the fatal_assert is there in case of problems, the cure here is worse than the problem. I "fixed" this in our (Windows) environment by (bluntly) removing the AWS_FATAL_ASSERT. If Windows cannot start a thread, the program is in deep trouble anyway and the chances of that actually happening are acceptable (to us). A neater fix would probably be to detect somehow that the program is actually in the process of exiting and then not bother (on windows, anyway) to start a cleanup thread. Alternatively, try to start the thread but not fatal-assert when it fails during exit. Or perhaps Windows can be convinced somehow to start the thread under these circumstances? @xhochy : The problem is Windows-only, the aws_thread_launch has two implementations (posix and windows). The problem is in the windows CreateThread which fails. |
I also encountered the same problem, but I made an error in the multi gpu training environment on Linux, and the single gpu training environment will not make an error. |
Any updates for your issue because I'm getting the same one |
Potentially related AWS issue: aws/aws-sdk-cpp#1809 Ran into this issue today while training a BPE tokenizer on a dataset. Train code: """Train a ByteLevelBPETokenizer based on a given dataset. The dataset must be on the HF Hub.
This script is adaptated from the Transformers example in https://github.com/huggingface/transformers/tree/main/examples/flax/language-modeling
"""
from os import PathLike
from pathlib import Path
from typing import Sequence, Union
from datasets import load_dataset
from tokenizers import ByteLevelBPETokenizer
def train_tokenizer(dataset_name: str = "oscar", dataset_config_name: str = "unshuffled_deduplicated_nl",
dataset_split: str = "train", dataset_textcol: str = "text",
vocab_size: int = 50265, min_frequency: int = 2,
special_tokens: Sequence[str] = ("<s>", "<pad>", "</s>", "<unk>", "<mask>"),
dout: Union[str, PathLike] = "."):
# load dataset
dataset = load_dataset(dataset_name, dataset_config_name, split=dataset_split)
# Instantiate tokenizer
tokenizer = ByteLevelBPETokenizer()
def batch_iterator(batch_size=1024):
for i in range(0, len(dataset), batch_size):
yield dataset[i: i + batch_size][dataset_textcol]
# Customized training
tokenizer.train_from_iterator(batch_iterator(), vocab_size=vocab_size, min_frequency=min_frequency,
special_tokens=special_tokens)
# Save to disk
pdout = Path(dout).resolve()
pdout.mkdir(exist_ok=True, parents=True)
tokenizer.save_model(str(pdout))
def main():
import argparse
cparser = argparse.ArgumentParser(description=__doc__, formatter_class=argparse.ArgumentDefaultsHelpFormatter)
cparser.add_argument("dataset_name", help="Name of dataset to use for tokenizer training")
cparser.add_argument("--dataset_config_name", default=None,
help="Name of the config to use for tokenizer training")
cparser.add_argument("--dataset_split", default=None,
help="Name of the split to use for tokenizer training (typically 'train')")
cparser.add_argument("--dataset_textcol", default="text",
help="Name of the text column to use for tokenizer training")
cparser.add_argument("--vocab_size", type=int, default=50265, help="Vocabulary size")
cparser.add_argument("--min_frequency", type=int, default=2, help="Minimal frequency of tokens")
cparser.add_argument("--special_tokens", nargs="+", default=["<s>", "<pad>", "</s>", "<unk>", "<mask>"],
help="Special tokens to add. Useful for specific training objectives. Note that if you wish"
" to use this tokenizer with a default transformers.BartConfig, then make sure that the"
" order of at least these special tokens are correct: BOS (0), padding (1), EOS (2)")
cparser.add_argument("--dout", default=".", help="Path to directory to save tokenizer.json file")
train_tokenizer(**vars(cparser.parse_args()))
if __name__ == "__main__":
main() Command: $WDIR="your_tokenizer"
python prepare_tokenizer.py dbrd --dataset_config_name plain_text --dataset_split unsupervised --dout $WDIR Output:
Running on datasets==2.4.0 and pyarrow==9.0.0 on RHEL 8. |
There is also a discussion here https://issues.apache.org/jira/browse/ARROW-15141 where it is suggested for conda users to use an older version of aws-sdk-cpp: |
Downgrading pyarrow to 6.0.1 solves the issue for me.
|
First of all, I’d never call a downgrade a solution, at most a (very) temporary workaround.
Furthermore: This bug also happens outside pyarrow, I incorporate AWS in a standalone Windows C-program and that crashes during exit.
From: Bo-Ru (Roy) Lu ***@***.***>
Sent: Thursday, 15 September 2022 01:12
To: huggingface/datasets ***@***.***>
Cc: Ruurd Beerstra ***@***.***>; Comment ***@***.***>
Subject: Re: [huggingface/datasets] Fatal error condition occurred in aws-c-io (Issue #3310)
Sent by an external sender. Please be cautious about clicking on links and opening attachments.
--------------------------------------------------------------------------------------------------------------------------------
Downgrading pyarrow to 6.0.1 solves the issue.
—
Reply to this email directly, view it on GitHub<#3310 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AKYUE3WBCSMHKJOOA2RQELLV6JLSVANCNFSM5IQ3WG7Q>.
You are receiving this because you commented.Message ID: ***@***.******@***.***>>
|
Very much so! It looks like an apparent fix for the underlying problem might have landed, but it sounds like it might still be a bit of a lift to get it into aws-sdk-cpp.
Sidenote: On conda-forge side, all recent pyarrow releases (all the way up to v9 and soon v10) have carried the respective pin and will not run into this issue.
|
For pip people, I confirmed that installing the nightly version of pyarrow also solves this by: |
Do you have a reproducer you could share? I'd like to test if the new versions that supposedly solve this actually do, but we don't have a way to test it... |
Hi,
No – sorry. It is part of a massive eco-system which cannot easily be shared.
But I think the problem was summarized quite clearly: Windows does not allow a CreateThread while doing ExitProcess.
The cleanup that gets called as part of the exit handler code tries to start a thread, the fatal-assert on that causes the crash, and in windows we get a very big dump file.
The fix I applied simply removes that fatal assert, that solves the problem for me.
I did not delve into the what the thread was trying to achieve and if that might cause issues when not executed during exit of the process. We did not notice anything of the kind.
However, we *did* notice the many, many gigabytes of accumulated dumps of hundreds of processes 😊
I’ll try and upgrade to the latest AWS version and report my findings, but that will be after I return from a month of vacationing…
* Regards – Ruurd Beerstra
From: h-vetinari ***@***.***>
Sent: Friday, 28 October 2022 02:09
To: huggingface/datasets ***@***.***>
Cc: Ruurd Beerstra ***@***.***>; Comment ***@***.***>
Subject: Re: [huggingface/datasets] Fatal error condition occurred in aws-c-io (Issue #3310)
Sent by an external sender. Please be cautious about clicking on links and opening attachments.
--------------------------------------------------------------------------------------------------------------------------------
Furthermore: This bug also happens outside pyarrow, I incorporate AWS in a standalone Windows C-program and that crashes during exit.
Do you have a reproducer you could share? I'd like to test if the new versions that supposedly solve this actually do, but we don't have a way to test it...
—
Reply to this email directly, view it on GitHub<#3310 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AKYUE3SHHPC5AT7KQ4GDAJDWFMKRTANCNFSM5IQ3WG7Q>.
You are receiving this because you commented.Message ID: ***@***.******@***.***>>
|
OK, was worth a try...
This seems to be what awslabs/aws-c-io#515 did upstream.
caution: aws-sdk-cpp hasn't yet upgraded its bundled(?) aws-c-io and hence doesn't contain the fix AFAICT |
Hi, I also encountered the same problem, but I made an error on Ubuntu without using At that time, I find my version of pyarrow is 9.0.0, which is different from as follow:
As it happens, I found this error message when I introduced the For example, I write following code: from transformers import Trainer
print('Hugging Face') I get the following error message: Hugging Face
Fatal error condition occurred in /opt/vcpkg/buildtrees/aws-c-io/src/9e6648842a-364b708815.clean/source/event_loop.c:72: aws_thread_launch(&cleanup_thread, s_event_loop_destroy_async_thread_fn, el_group, &thread_options) == AWS_OP_SUCCESS
Exiting Application
################################################################################
Stack trace:
################################################################################
/home/ubuntu/anaconda3/envs/pytorch38/lib/python3.8/site-packages/pyarrow/libarrow.so.900(+0x200af06) [0x7fa9add1df06]
/home/ubuntu/anaconda3/envs/pytorch38/lib/python3.8/site-packages/pyarrow/libarrow.so.900(+0x20028e5) [0x7fa9add158e5]
/home/ubuntu/anaconda3/envs/pytorch38/lib/python3.8/site-packages/pyarrow/libarrow.so.900(+0x1f27e09) [0x7fa9adc3ae09]
/home/ubuntu/anaconda3/envs/pytorch38/lib/python3.8/site-packages/pyarrow/libarrow.so.900(+0x200ba3d) [0x7fa9add1ea3d]
/home/ubuntu/anaconda3/envs/pytorch38/lib/python3.8/site-packages/pyarrow/libarrow.so.900(+0x1f25948) [0x7fa9adc38948]
/home/ubuntu/anaconda3/envs/pytorch38/lib/python3.8/site-packages/pyarrow/libarrow.so.900(+0x200ba3d) [0x7fa9add1ea3d]
/home/ubuntu/anaconda3/envs/pytorch38/lib/python3.8/site-packages/pyarrow/libarrow.so.900(+0x1ee0b46) [0x7fa9adbf3b46]
/home/ubuntu/anaconda3/envs/pytorch38/lib/python3.8/site-packages/pyarrow/libarrow.so.900(+0x194546a) [0x7fa9ad65846a]
/lib/x86_64-linux-gnu/libc.so.6(+0x468d7) [0x7faa2fcfe8d7]
/lib/x86_64-linux-gnu/libc.so.6(on_exit+0) [0x7faa2fcfea90]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xfa) [0x7faa2fcdc0ba]
/home/ubuntu/anaconda3/envs/pytorch38/bin/python(+0x1f9ad7) [0x5654571d1ad7] But, when I remove the So Why ? Environment info
|
Not sure what's going on, but that shouldn't happen, especially as we're pinning to a version that should avoid this. Can you please open an issue https://github.com/conda-forge/arrow-cpp-feedstock, including the requested output of |
pyarrow 10.0.1 was just released in conda-forge, which is the first release where we're building against aws-sdk-cpp 1.9.* again after more than a year. Since we cannot test the failure reported here on our infra, I'd be very grateful if someone could verify that the problem does or doesn't reappear. 🙃
|
The problem is gone after I install the new version. Thanks! |
@liuchaoqun, with |
Describe the bug
Fatal error when using the library
Steps to reproduce the bug
Expected results
No fatal errors
Actual results
Environment info
datasets
version: 1.15.2.dev0The text was updated successfully, but these errors were encountered: