Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CI][Python] wheel-manylinux2014-* sometimes crashed on pytest exit #15054

Closed
kou opened this issue Dec 20, 2022 · 32 comments · Fixed by #33858
Closed

[CI][Python] wheel-manylinux2014-* sometimes crashed on pytest exit #15054

kou opened this issue Dec 20, 2022 · 32 comments · Fixed by #33858

Comments

@kou
Copy link
Member

kou commented Dec 20, 2022

Describe the bug, including details regarding any error messages, version, and platform.

wheel-manylinux2014-cp38-amd64 2022-12-20 nighty:

https://github.com/ursacomputing/crossbow/actions/runs/3738813751/jobs/6345300403#step:9:775

= 4294 passed, 382 skipped, 18 xfailed, 2 xpassed, 4 warnings in 522.05s (0:08:42) =
/arrow/ci/scripts/python_wheel_unix_test.sh: line 95:    79 Segmentation fault      (core dumped) python -m pytest -r s --pyargs pyarrow

wheel-manylinux2014-cp39-arm64 2022-12-20 nighty:

https://app.travis-ci.com/github/ursacomputing/crossbow/builds/259026563#L5708

= 4324 passed, 352 skipped, 18 xfailed, 2 xpassed, 1 warning in 141.20s (0:02:21) =
/arrow/ci/scripts/python_wheel_unix_test.sh: line 95:    15 Segmentation fault      (core dumped) python -m pytest -r s --pyargs pyarrow

Component(s)

Continuous Integration, Packaging, Python

@raulcd
Copy link
Member

raulcd commented Jan 9, 2023

This is still happening, core dump on those wheel jobs yesterday:
wheel-manylinux2014-cp310-arm64
wheel-manylinux2014-cp311-amd64
wheel-manylinux2014-cp37-amd64
wheel-manylinux2014-cp39-amd64
wheel-manylinux2014-cp39-arm64
I am marking this as a release blocker.

@lidavidm
Copy link
Member

lidavidm commented Jan 13, 2023

I can sort of reproduce; you have to go through the release verification process. Still trying to get GDB inside this process so I can try to backtrace (the test that SIGINTs itself is giving me a bit of trouble). I can't yet reproduce with a 22.04 container directly; trying to make sure the setup is exactly the same.

....and it doesn't crash under gdb, hmm.

@lidavidm
Copy link
Member

And now I can no longer get the crash to reproduce either, so I'm not sure I can make much headway on this.

@assignUser
Copy link
Member

This is now the last non-archery blocker, with the difficulty in reproducing this should it still stay a blocker?

@raulcd
Copy link
Member

raulcd commented Jan 17, 2023

It does seem to have happened on the latest test-fedora-35-python-3 on Azure too: https://dev.azure.com/ursacomputing/crossbow/_build/results?buildId=42753&view=logs&j=0da5d1d9-276d-5173-c4c4-9d4d4ed14fdb&t=d9b15392-e4ce-5e4c-0c8c-b69645229181&l=5623

===== 6762 passed, 356 skipped, 18 xfailed, 2 xpassed in 295.11s (0:04:55) =====
/arrow/ci/scripts/python_test.sh: line 60: 22828 Segmentation fault      (core dumped) pytest -r s ${PYTEST_ARGS} --pyargs pyarrow
139

I am not entirely sure this is a blocker for the release at this point. It's quite difficult to reproduce and flaky (not always happens) and it only seems to happen on test exit (as all tests are successful). I would probably remove it from being a blocker on the release and keep investigating around it. @kou @lidavidm what are your thoughts about moving it out of the release?

@lidavidm
Copy link
Member

lidavidm commented Jan 17, 2023

I am OK with this, though we will have to note it at verification time. (And actually, I think it will make binary verification quite painful...)

Can we get some more systematic approaches to solving this? Would it be possible, for instance, to backtrace failing tests automatically, or upload a core dump? In #33720 I'm going to try enabling both PYTHONFAULTHANDLER and catchsegv to see if we can get any more info on what's happening.

@lidavidm
Copy link
Member

I think it's an AWS SDK issue again!

I enabled catchsegv here: https://github.com/ursacomputing/crossbow/actions/runs/3939847223/jobs/6740188523

The crashing instruction is at 0x7fed49842bb7:

Backtrace:
/usr/local/lib/python3.9/site-packages/pyarrow/libarrow.so.1100(+0x17b5bb7)[0x7fed49842bb7]

libarrow.so is mapped to 0x7fed4808d000:

7fed4808d000-7fed484d7000 r--p 00000000 00:2c 2138445 /usr/local/lib/python3.9/site-packages/pyarrow/libarrow.so.1100

Subtracting the two, we're looking for 0x17b5bb7 in libarrow.so:

$ objdump --demangle -d -T libarrow.so.1100
...
00000000017b5b90 <Aws::Http::CurlHandleContainer::~CurlHandleContainer()>:
 17b5b90:	41 57                	push   %r15
 17b5b92:	49 89 ff             	mov    %rdi,%r15
 17b5b95:	41 56                	push   %r14
 17b5b97:	41 55                	push   %r13
 17b5b99:	41 54                	push   %r12
 17b5b9b:	55                   	push   %rbp
 17b5b9c:	53                   	push   %rbx
 17b5b9d:	48 81 ec 08 02 00 00 	sub    $0x208,%rsp
 17b5ba4:	e8 37 b3 f9 ff       	callq  1750ee0 <Aws::Utils::Logging::GetLogSystem()>
 17b5ba9:	48 85 c0             	test   %rax,%rax
 17b5bac:	74 15                	je     17b5bc3 <Aws::Http::CurlHandleContainer::~CurlHandleContainer()+0x33>
 17b5bae:	48 89 c5             	mov    %rax,%rbp
 17b5bb1:	48 8b 00             	mov    (%rax),%rax
 17b5bb4:	48 89 ef             	mov    %rbp,%rdi
 17b5bb7:	ff 50 10             	callq  *0x10(%rax)
...

@lidavidm
Copy link
Member

Reconstructed backtrace:

Backtrace:
libarrow.so.1100(+0x17b5bb7)[0x7fed49842bb7] Aws::Http::CurlHandleContainer::~CurlHandleContainer()
libarrow.so.1100(+0x17a1d83)[0x7fed4982ed83] Aws::Http::CurlHttpClient::~CurlHttpClient()
libarrow.so.1100(+0x591c6a)[0x7fed4861ec6a]  std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release()
libarrow.so.1100(+0x16d882f)[0x7fed4976582f] Aws::S3::S3Client::~S3Client()
libarrow.so.1100(+0x591c6a)[0x7fed4861ec6a]  std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release()
libarrow.so.1100(+0x12d5a94)[0x7fed49362a94] std::_Sp_counted_ptr<arrow::fs::(anonymous namespace)::RegionResolver*, (__gnu_cxx::_Lock_policy)2>::_M_dispose()
libarrow.so.1100(+0x591c6a)[0x7fed4861ec6a]  std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release()

…maybe we could just leak the S3 client as a workaround?

@kou
Copy link
Member Author

kou commented Jan 17, 2023

aws/aws-sdk-cpp#1833 ?

@kou
Copy link
Member Author

kou commented Jan 17, 2023

I'm OK with removing the blocker label from this.

@raulcd
Copy link
Member

raulcd commented Jan 17, 2023

I am not too familiar with this but as far as I understand this is not something we have introduced on this release, right? I can't see any change on how we deal with the S3Client and we have not changed the version of aws-sdk-cpp yet: #20272

@kou
Copy link
Member Author

kou commented Jan 17, 2023

Right.

@raulcd raulcd added Priority: Critical and removed Priority: Blocker Marks a blocker for the release labels Jan 18, 2023
@raulcd raulcd modified the milestones: 11.0.0, 12.0.0 Jan 18, 2023
@raulcd
Copy link
Member

raulcd commented Jan 19, 2023

As suggested by @lidavidm this is causing release verification (#33751) to be a bit painful.
I had to retry the wheel-manylinux2014-cp39-amd64 11 times for it to succeed (1 of the early failures was due to a failure uploading the artifact): https://github.com/ursacomputing/crossbow/actions/runs/3949372445/jobs/6760493478
And currently verify-rc-binaries-wheels-linux-conda-latest-amd64 is still failing with the crash on pytest exit after several retries: https://github.com/ursacomputing/crossbow/actions/runs/3954649758/jobs/6772203240
Those have been the only jobs that have been having the issue as far as I am aware but it has been quite consistent on those. I wanted to raise that so it is taken into account on the release vote.

@pitrou
Copy link
Member

pitrou commented Jan 19, 2023

…maybe we could just leak the S3 client as a workaround?

We can probably do that indeed. We just have to make sure to disable S3 on ASAN and Valgrind CI jobs.

@paleolimbot
Copy link
Member

FIWIW this looks very similar/maybe is the same as #15189, which the R package fixed by skipping on platforms with a very old SSL runtime (MacOS 10.13 was the culprit for us).

@pitrou
Copy link
Member

pitrou commented Jan 19, 2023

Judging by the reconstructed stack trace (thanks @lidavidm !), this has nothing to do with OpenSSL but with calling a logging method:

CurlHandleContainer::~CurlHandleContainer()
{
    AWS_LOGSTREAM_INFO(CURL_HANDLE_CONTAINER_TAG, "Cleaning up CurlHandleContainer.");
    for (CURL* handle : m_handleContainer.ShutdownAndWait(m_poolSize))
    {
        AWS_LOGSTREAM_DEBUG(CURL_HANDLE_CONTAINER_TAG, "Cleaning up " << handle);
        curl_easy_cleanup(handle);
    }
}

The definition of AWS_LOGSTREAM_INFO is:

    #define AWS_LOGSTREAM_INFO(tag, streamExpression) \
        { \
            Aws::Utils::Logging::LogSystemInterface* logSystem = Aws::Utils::Logging::GetLogSystem(); \
            if ( logSystem && logSystem->GetLogLevel() >= Aws::Utils::Logging::LogLevel::Info ) \
            { \
                Aws::OStringStream logStream; \
                logStream << streamExpression; \
                logSystem->LogStream( Aws::Utils::Logging::LogLevel::Info, tag, logStream ); \
            } \
        }

@westonpace
Copy link
Member

Looks like there is a global static RegionResolver that has an S3Client and that S3Client is being destroyed after S3 has already been destroyed?

@pitrou
Copy link
Member

pitrou commented Jan 19, 2023

Yes, it's probably that. Though "S3 has already been destroyed" is a bit vague (is it after DLL unload?).

@westonpace
Copy link
Member

Ah! No, a C atexit hook would probably not work. I'm talking about a Python atexit hook. However, there is the flipped issue that it might be called too early (potentially letting Arrow threads make calls into a finalized AWS SDK).

Perhaps it is time then for an arrow::Shutdown method?

@pitrou
Copy link
Member

pitrou commented Jan 19, 2023

Perhaps it is time then for an arrow::Shutdown method?

That wouldn't really solve the issue I mentioned, would it? :-)

@westonpace
Copy link
Member

That wouldn't really solve the issue I mentioned, would it? :-)

A python atexit hook could call arrow::EnsureShutdown. In arrow::Shutdown we would first shut down the thread pools. Then we would shut down S3.

@westonpace
Copy link
Member

@pitrou @lidavidm

I prototyped a shutdown mechanism this morning (9d62b2c) and I think it could work but I realized there might be another way we can solve this problem as we ran into a similar problem with open telemetry.

We added the concept of Executor::KeepAlive which is a method that registers a resource with our thread pool instance and keeps that resource alive until all threads have exited. Our OT code then uses this to ensure that OT is not shutdown until all threads have exited.

We can extend this mechanism to S3. The main drawback will be that a call to FinalizeS3 will not actually call Aws::ShutdownAPI until process exit (after the thread pool has shut down). Would this be an acceptable workaround? It would allow us to safely use a python atexit hook to call FinalizeS3.

@lidavidm
Copy link
Member

How would that work in this case? Something which keeps the AWS SDK alive until after all S3 clients have been destroyed?

@westonpace
Copy link
Member

How would that work in this case? Something which keeps the AWS SDK alive until after all S3 clients have been destroyed?

  • S3's finalize would be guarded by a shared_ptr (or some kind of reference count). One copy would live with the thread pools (as a resource). Well, one copy with each thread pool. One copy would live with S3Finalize.
  • Python's atexit would call S3Finalize which would decrement the shared pointer by 1
  • As the process closes the thread pool is shut down as part of the normal destruction of static state. Once it finishes it then clears it's "resources" which would include the S3 finalizer.
  • The reference count would now drop to 0 and it should be safe to finalize S3 (presumable the user is done because we have hit python atexit and presumably arrow is done because the thread pools are shutdown).
  • We would then call Aws::ShutdownAPI as the destructor for the shared_ptr.

@pitrou
Copy link
Member

pitrou commented Jan 24, 2023

It sounds like we can give this a try @westonpace .

@westonpace westonpace added the Priority: Blocker Marks a blocker for the release label Apr 11, 2023
westonpace added a commit that referenced this issue Apr 15, 2023
…finished, add pyarrow exit hook (#33858)

CRITICAL FIX: When statically linking error with AWS it was possible to have a crash on shutdown/exit.  Now that should no longer be possible.

BREAKING CHANGE: S3 can only be initialized and finalized once.

BREAKING CHANGE: S3 (the AWS SDK) will not be finalized until after all CPU & I/O threads are finished.
* Closes: #15054

Authored-by: Weston Pace <weston.pace@gmail.com>
Signed-off-by: Weston Pace <weston.pace@gmail.com>
@westonpace westonpace modified the milestones: 12.0.0, 13.0.0 Apr 15, 2023
raulcd pushed a commit that referenced this issue Apr 17, 2023
…finished, add pyarrow exit hook (#33858)

CRITICAL FIX: When statically linking error with AWS it was possible to have a crash on shutdown/exit.  Now that should no longer be possible.

BREAKING CHANGE: S3 can only be initialized and finalized once.

BREAKING CHANGE: S3 (the AWS SDK) will not be finalized until after all CPU & I/O threads are finished.
* Closes: #15054

Authored-by: Weston Pace <weston.pace@gmail.com>
Signed-off-by: Weston Pace <weston.pace@gmail.com>
liujiacheng777 pushed a commit to LoongArch-Python/arrow that referenced this issue May 11, 2023
…reads finished, add pyarrow exit hook (apache#33858)

CRITICAL FIX: When statically linking error with AWS it was possible to have a crash on shutdown/exit.  Now that should no longer be possible.

BREAKING CHANGE: S3 can only be initialized and finalized once.

BREAKING CHANGE: S3 (the AWS SDK) will not be finalized until after all CPU & I/O threads are finished.
* Closes: apache#15054

Authored-by: Weston Pace <weston.pace@gmail.com>
Signed-off-by: Weston Pace <weston.pace@gmail.com>
ArgusLi pushed a commit to Bit-Quill/arrow that referenced this issue May 15, 2023
…reads finished, add pyarrow exit hook (apache#33858)

CRITICAL FIX: When statically linking error with AWS it was possible to have a crash on shutdown/exit.  Now that should no longer be possible.

BREAKING CHANGE: S3 can only be initialized and finalized once.

BREAKING CHANGE: S3 (the AWS SDK) will not be finalized until after all CPU & I/O threads are finished.
* Closes: apache#15054

Authored-by: Weston Pace <weston.pace@gmail.com>
Signed-off-by: Weston Pace <weston.pace@gmail.com>
rtpsw pushed a commit to rtpsw/arrow that referenced this issue May 16, 2023
…reads finished, add pyarrow exit hook (apache#33858)

CRITICAL FIX: When statically linking error with AWS it was possible to have a crash on shutdown/exit.  Now that should no longer be possible.

BREAKING CHANGE: S3 can only be initialized and finalized once.

BREAKING CHANGE: S3 (the AWS SDK) will not be finalized until after all CPU & I/O threads are finished.
* Closes: apache#15054

Authored-by: Weston Pace <weston.pace@gmail.com>
Signed-off-by: Weston Pace <weston.pace@gmail.com>
mattw-nws added a commit to mattw-nws/ngen that referenced this issue May 17, 2023
mattw-nws added a commit to NOAA-OWP/ngen that referenced this issue May 17, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment