-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[CI][Python] wheel-manylinux2014-* sometimes crashed on pytest exit #15054
Comments
This is still happening, core dump on those wheel jobs yesterday: |
I can sort of reproduce; you have to go through the release verification process. Still trying to get GDB inside this process so I can try to backtrace (the test that SIGINTs itself is giving me a bit of trouble). I can't yet reproduce with a 22.04 container directly; trying to make sure the setup is exactly the same. ....and it doesn't crash under gdb, hmm. |
And now I can no longer get the crash to reproduce either, so I'm not sure I can make much headway on this. |
This is now the last non-archery blocker, with the difficulty in reproducing this should it still stay a blocker? |
It does seem to have happened on the latest test-fedora-35-python-3 on Azure too: https://dev.azure.com/ursacomputing/crossbow/_build/results?buildId=42753&view=logs&j=0da5d1d9-276d-5173-c4c4-9d4d4ed14fdb&t=d9b15392-e4ce-5e4c-0c8c-b69645229181&l=5623
I am not entirely sure this is a blocker for the release at this point. It's quite difficult to reproduce and flaky (not always happens) and it only seems to happen on test exit (as all tests are successful). I would probably remove it from being a blocker on the release and keep investigating around it. @kou @lidavidm what are your thoughts about moving it out of the release? |
I am OK with this, though we will have to note it at verification time. (And actually, I think it will make binary verification quite painful...) Can we get some more systematic approaches to solving this? Would it be possible, for instance, to backtrace failing tests automatically, or upload a core dump? In #33720 I'm going to try enabling both PYTHONFAULTHANDLER and catchsegv to see if we can get any more info on what's happening. |
I think it's an AWS SDK issue again! I enabled catchsegv here: https://github.com/ursacomputing/crossbow/actions/runs/3939847223/jobs/6740188523 The crashing instruction is at 0x7fed49842bb7:
libarrow.so is mapped to 0x7fed4808d000:
Subtracting the two, we're looking for 0x17b5bb7 in libarrow.so:
|
Reconstructed backtrace:
…maybe we could just leak the S3 client as a workaround? |
I'm OK with removing the blocker label from this. |
I am not too familiar with this but as far as I understand this is not something we have introduced on this release, right? I can't see any change on how we deal with the |
Right. |
As suggested by @lidavidm this is causing release verification (#33751) to be a bit painful. |
We can probably do that indeed. We just have to make sure to disable S3 on ASAN and Valgrind CI jobs. |
FIWIW this looks very similar/maybe is the same as #15189, which the R package fixed by skipping on platforms with a very old SSL runtime (MacOS 10.13 was the culprit for us). |
Judging by the reconstructed stack trace (thanks @lidavidm !), this has nothing to do with OpenSSL but with calling a logging method: CurlHandleContainer::~CurlHandleContainer()
{
AWS_LOGSTREAM_INFO(CURL_HANDLE_CONTAINER_TAG, "Cleaning up CurlHandleContainer.");
for (CURL* handle : m_handleContainer.ShutdownAndWait(m_poolSize))
{
AWS_LOGSTREAM_DEBUG(CURL_HANDLE_CONTAINER_TAG, "Cleaning up " << handle);
curl_easy_cleanup(handle);
}
} The definition of #define AWS_LOGSTREAM_INFO(tag, streamExpression) \
{ \
Aws::Utils::Logging::LogSystemInterface* logSystem = Aws::Utils::Logging::GetLogSystem(); \
if ( logSystem && logSystem->GetLogLevel() >= Aws::Utils::Logging::LogLevel::Info ) \
{ \
Aws::OStringStream logStream; \
logStream << streamExpression; \
logSystem->LogStream( Aws::Utils::Logging::LogLevel::Info, tag, logStream ); \
} \
} |
Looks like there is a global static |
Yes, it's probably that. Though "S3 has already been destroyed" is a bit vague (is it after DLL unload?). |
Perhaps it is time then for an |
That wouldn't really solve the issue I mentioned, would it? :-) |
A python atexit hook could call |
I prototyped a shutdown mechanism this morning (9d62b2c) and I think it could work but I realized there might be another way we can solve this problem as we ran into a similar problem with open telemetry. We added the concept of We can extend this mechanism to S3. The main drawback will be that a call to |
How would that work in this case? Something which keeps the AWS SDK alive until after all S3 clients have been destroyed? |
|
It sounds like we can give this a try @westonpace . |
…finished, add pyarrow exit hook (#33858) CRITICAL FIX: When statically linking error with AWS it was possible to have a crash on shutdown/exit. Now that should no longer be possible. BREAKING CHANGE: S3 can only be initialized and finalized once. BREAKING CHANGE: S3 (the AWS SDK) will not be finalized until after all CPU & I/O threads are finished. * Closes: #15054 Authored-by: Weston Pace <weston.pace@gmail.com> Signed-off-by: Weston Pace <weston.pace@gmail.com>
…finished, add pyarrow exit hook (#33858) CRITICAL FIX: When statically linking error with AWS it was possible to have a crash on shutdown/exit. Now that should no longer be possible. BREAKING CHANGE: S3 can only be initialized and finalized once. BREAKING CHANGE: S3 (the AWS SDK) will not be finalized until after all CPU & I/O threads are finished. * Closes: #15054 Authored-by: Weston Pace <weston.pace@gmail.com> Signed-off-by: Weston Pace <weston.pace@gmail.com>
…reads finished, add pyarrow exit hook (apache#33858) CRITICAL FIX: When statically linking error with AWS it was possible to have a crash on shutdown/exit. Now that should no longer be possible. BREAKING CHANGE: S3 can only be initialized and finalized once. BREAKING CHANGE: S3 (the AWS SDK) will not be finalized until after all CPU & I/O threads are finished. * Closes: apache#15054 Authored-by: Weston Pace <weston.pace@gmail.com> Signed-off-by: Weston Pace <weston.pace@gmail.com>
…reads finished, add pyarrow exit hook (apache#33858) CRITICAL FIX: When statically linking error with AWS it was possible to have a crash on shutdown/exit. Now that should no longer be possible. BREAKING CHANGE: S3 can only be initialized and finalized once. BREAKING CHANGE: S3 (the AWS SDK) will not be finalized until after all CPU & I/O threads are finished. * Closes: apache#15054 Authored-by: Weston Pace <weston.pace@gmail.com> Signed-off-by: Weston Pace <weston.pace@gmail.com>
…reads finished, add pyarrow exit hook (apache#33858) CRITICAL FIX: When statically linking error with AWS it was possible to have a crash on shutdown/exit. Now that should no longer be possible. BREAKING CHANGE: S3 can only be initialized and finalized once. BREAKING CHANGE: S3 (the AWS SDK) will not be finalized until after all CPU & I/O threads are finished. * Closes: apache#15054 Authored-by: Weston Pace <weston.pace@gmail.com> Signed-off-by: Weston Pace <weston.pace@gmail.com>
…... hopefully will be fixed in v13?
…... hopefully will be fixed in v13?
Describe the bug, including details regarding any error messages, version, and platform.
wheel-manylinux2014-cp38-amd64 2022-12-20 nighty:
https://github.com/ursacomputing/crossbow/actions/runs/3738813751/jobs/6345300403#step:9:775
wheel-manylinux2014-cp39-arm64 2022-12-20 nighty:
https://app.travis-ci.com/github/ursacomputing/crossbow/builds/259026563#L5708
Component(s)
Continuous Integration, Packaging, Python
The text was updated successfully, but these errors were encountered: