Fix intermittent startup issues on slow systems #3803
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
With a suitably slow system -- say, if you're waiting on a Windows Terminal to output text inside a virtual machine -- the expected order of things can be thrown off. One specific way this can happen is that the worker can attempt to dial and have its dial fail because of the slowness causing the dial context to time out. If, because of the slowness, we also haven't released the log gate yet, then the event with the failure information will be queued, along with the context that was used.
Unfortunately, in some of these error cases, the context that was used was the dial context instead of the system base context. In many other places in the function it was the system base context so this is just a mismatch, probably from code written at different times.
Normally this wouldn't be a problem as we'd fall back to the underlying logger, but when we release the log gate, things happen differently: it's a synchronous function and on error it causes us to abandon system startup entirely.
This commit fixes the issue with using the incorrect context. It's an open question whether we should change the behavior around errors when replaying queued events, falling back to the underlying logger instead of erroring.