Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RCORE-2234 Crash in dart due to debug output when app is being torn down #7985

Merged
merged 16 commits into from
Aug 22, 2024

Conversation

michael-wb
Copy link
Contributor

@michael-wb michael-wb commented Aug 14, 2024

What, How & Why?

The static mutex used to add thread safety to the StderrLogger was sometimes causing a crash on app exit when the mutex was destroyed before the stderr logger shared pointer instance was deallocated.

These changes remove the mutex altogether and use the inherent thread safety in std::cerr to prevent commingling the debug messages.

Fixes #7969

☑️ ToDos

  • 📝 Changelog update
  • [ ] 🚦 Tests (or not relevant)
  • [ ] C-API, if public C++ API changed
  • [ ] bindgen/spec.yml, if public C++ API changed

@michael-wb michael-wb self-assigned this Aug 14, 2024
@cla-bot cla-bot bot added the cla: yes label Aug 14, 2024
Copy link

coveralls-official bot commented Aug 14, 2024

Pull Request Test Coverage Report for Build michael.wilkersonbarker_1354

Details

  • 1 of 1 (100.0%) changed or added relevant line in 1 file are covered.
  • No unchanged relevant lines lost coverage.
  • Overall first build on mwb/logger-crash-during-teardown at 91.127%

Totals Coverage Status
Change from base Build 2575: 91.1%
Covered Lines: 217450
Relevant Lines: 238624

💛 - Coveralls

@michael-wb michael-wb marked this pull request as ready for review August 16, 2024 00:01
Comment on lines 190 to 197
void StderrLogger::do_write(std::string&& output)
{
REALM_ASSERT(m_log_mutex);
// Lock the mutex to avoid comingling the logger output messages
std::lock_guard l(*m_log_mutex.get());
// std::cerr is unbuffered, so no need to flush
std::cerr << cat.get_name() << " - " << get_level_prefix(level) << message << '\n'; // Throws
std::cerr << output << std::endl; // Throws
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was added so the StderrLogger and the TimestampStderrLogger use the same static mutex when writing to stderr.

void StderrLogger::do_write(std::string&& output)
{
REALM_ASSERT(m_log_mutex);
// Lock the mutex to avoid comingling the logger output messages
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

something i've never understood is why we need this mutex at all. std::cerr is threadsafe on its own. if the C++ standard library is working correctly, you should never see comingling of logger output messages to stderr.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed the mutex - a single << call to std::cerr is thread safe, so output message is created first and then pushed to stderr as a single string.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to be clear, i meant this as a question rather than a statement. is the reason we added the mutex here because we were seeing interleaving or something else?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There was some interleaving in the past during the tests. For std::cerr, each individual streaming operator call is thread safe, but chained operators are not, allowing context changes to happen before the entire message is printed.
e.g., std::cerr << "string a" << "string b"; ==> std::cerr << "string a"; std::cerr << "string b";

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But unfortunately, MacOS TSAN is not happy with writing to std::cerr from two threads, but it is fine with fprintf() and supposedly it is faster then std::cerr. I updated StderrLogger to use fprintf() instead.

// std::cerr is unbuffered, so no need to flush
std::cerr << cat.get_name() << " - " << get_level_prefix(level) << message << '\n'; // Throws
std::cerr << output << std::endl; // Throws
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the comment right above says that you don't need to do std::endl because cerr is unbuffered.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, bad habit - reverted to '\n'

@@ -52,13 +52,6 @@ using namespace std::string_literals;
using Catch::Matchers::StartsWith;
using nlohmann::json;

static auto audit_logger =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why did this need to change?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was trying to remove the instances of static loggers, but it doesn't matter here - reverted the change

@@ -9,15 +9,15 @@
namespace realm {
namespace util {

class TimestampStderrLogger : public Logger {
class TimestampStderrLogger : public StderrLogger {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like the TimestampStderrLogger is completely unused outside of tests. Can we just remove it?


void StderrLogger::do_write(std::string&& output)
{
REALM_ASSERT(m_log_mutex);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When could this assert ever be false?

@jbreams
Copy link
Contributor

jbreams commented Aug 16, 2024

as a more general high level question - did we get to the bottom of why this was crashing in dart? destruction of statics is undefined behavior, but generally i think we've seen that happen while the process is exiting, which implies dart maybe isn't tearing realm down cleanly since there are threads out there still running while the process exits? do we know what's actually going on here?

@michael-wb
Copy link
Contributor Author

as a more general high level question - did we get to the bottom of why this was crashing in dart? destruction of statics is undefined behavior, but generally i think we've seen that happen while the process is exiting, which implies dart maybe isn't tearing realm down cleanly since there are threads out there still running while the process exits? do we know what's actually going on here?

I tried reproducing the issue with several versions of realm-dart, but I was never able to reproduce it in a macos app or in the ios simulator. From the stack trace, however, I believe it is clear that the crash is occurring because the static mutex is no longer valid when DB::close_internal() tries to print a log message at the detail level when a realm object is being closed.

@jbreams
Copy link
Contributor

jbreams commented Aug 19, 2024

Tldr - bug fixes should be scoped narrowly, especially if we haven't been able to reproduce the crash. Going out and refactoring things as part of a speculative bugfix is how you make more bugs.

It's pretty disappointing that we never reproduced this despite having a sample project that demonstrates the problem. I think it's a good guess that this static mutex is being destroyed before all the databases have been torn down and that's causing problems, but it would be great to know whether that behavior in itself is a bug or if this is just weird behavior we'll have to work around.

All that said, I think the first approach of using a std::shared_ptr was valid. A simpler solution could have been to just treat this mutex the way we treat the default logger mutex and other static mutexes we really want to live forever by doing auto& stderr_logger_mutex = *new std::mutex;. I don't think we should refactor the StderrLogger with the do_write(std::string&&) method because that introduces a new allocation/copy in a potentially pretty hot path. I don't think we need to touch tests that are passing. I think the timestamp logger can just be left alone.

I also think it's fine to keep using std::cerr and have a mutex to synchronize access to it, I just wanted to understand why we needed it. The answer that we were seeing interpolation is totally valid. It looks like it was added during a general CI improvements PR, so I wasn't sure the context for it being added.

@michael-wb
Copy link
Contributor Author

Understood @jbreams - I looked further into how the realm shared lib was being loaded/unloaded. The realm library is loaded the first time the realmLib global variable is used (e.g., for calling a CAPI function), but it is never explicitly unloaded, so it is up to the Dart to decide when the realmLib object is destroyed and the library is unloaded. Being able to reproduce this may be dependent on the version of the dart compiler used which could change the order in which the library is unloaded.

I see that JS also has had some logger crashes (which may be due to the all issue) and I reached out to Kræn to see if it is still happening with the fix for PR #7934. If it doesn't fix the issue, then perhaps the JS SDK is also being affected by the mutex error and I asked him to see if these changes resolve the issue.

@michael-wb
Copy link
Contributor Author

I was able to reproduce the logger crash yesterday on macos using the Dart SDK 3.4.1, but I have been unable to create a local build of the Dart SDK that is able to compile a macos flutter app to verify the changes in this PR. I can build my flutter app for iOS with my local Dart SDK build, but I haven't been able to reproduce the logger crash on that platform.

I've reached out to the SDK team to see if they can help with my flutter app compile issue.

@michael-wb
Copy link
Contributor Author

Verified these changes address the crash in Dart - I've also reached out to the JS SDK team to see if this also helps with the logger crash they are seeing as well.

@jbreams
Copy link
Contributor

jbreams commented Aug 20, 2024

So what's going on? Why is the mutex getting destroyed when it is?

void StderrLogger::do_log(const LogCategory& cat, Level level, const std::string& message)
{
static Mutex mutex;
LockGuard l(mutex);
std::lock_guard l(m_mutex);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason we need to capture the global mutex as a member variable reference? Could the whole change for this ticket just be changing this line from static Mutex mutex; to static auto& mutex = *new Mutex; ? Or changing the mutex from a Mutex to a std::mutex as you do above and locking it here without changing any of the headers?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wasn't sure if there was a risk that the static auto& mutex variable could become invalid so I was saving a local reference to the mutex. Fortunately, this should be defined in the .data region of the application, so actually there shouldn't be any risk of it going away. Just verified with a simple test, so I'll remove the extra reference.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also verified that the minimal set of logger changes are work fine with the Dart SDK.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you expand a bit on how the variable could become invalid and how taking a reference to it would prevent it from becoming invalid? Do you mean there was a risk the heap-allocated mutex could go out-of-scope and be destroyed without anything calling delete on it? How would taking a reference to it prevent that and not just lead to a dangling reference? How does being in the .data section of the image change things here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was referring to the global static reference (s_stderr_logger_mutex) variable being destroyed/going away during the app exit. But, since globals/statics like this are symbols defined in the program memory (.bss or .data sections), it will not go away during the lifetime of the app.

@michael-wb michael-wb merged commit 2424fde into master Aug 22, 2024
42 of 43 checks passed
@michael-wb michael-wb deleted the mwb/logger-crash-during-teardown branch August 22, 2024 17:43
@github-actions github-actions bot mentioned this pull request Aug 30, 2024
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Sep 21, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Crash in dart due to debug output when app is being torn down
3 participants