Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix intermittent startup issues on slow systems #3803

Merged
merged 1 commit into from
Oct 6, 2023

Conversation

jefferai
Copy link
Member

@jefferai jefferai commented Oct 5, 2023

With a suitably slow system -- say, if you're waiting on a Windows Terminal to output text inside a virtual machine -- the expected order of things can be thrown off. One specific way this can happen is that the worker can attempt to dial and have its dial fail because of the slowness causing the dial context to time out. If, because of the slowness, we also haven't released the log gate yet, then the event with the failure information will be queued, along with the context that was used.

Unfortunately, in some of these error cases, the context that was used was the dial context instead of the system base context. In many other places in the function it was the system base context so this is just a mismatch, probably from code written at different times.

Normally this wouldn't be a problem as we'd fall back to the underlying logger, but when we release the log gate, things happen differently: it's a synchronous function and on error it causes us to abandon system startup entirely.

This commit fixes the issue with using the incorrect context. It's an open question whether we should change the behavior around errors when replaying queued events, falling back to the underlying logger instead of erroring.

@jefferai jefferai added this to the 0.14.x milestone Oct 6, 2023
With a suitably slow system -- say, if you're waiting on a Windows
Terminal to output text inside a virtual machine -- the expected order
of things can be thrown off. One specific way this can happen is that
the worker can attempt to dial and have its dial fail because of the
slowness causing the dial context to time out. If, because of the
slowness, we also haven't released the log gate yet, then the event with
the failure information will be queued, along with the context that was
used.

Unfortunately, in some of these error cases, the context that was used
was the dial context instead of the system base context. In many other
places in the function it was the system base context so this is just a
mismatch, probably from code written at different times.

Normally this wouldn't be a problem as we'd fall back to the underlying
logger, but when we release the log gate, things happen differently:
it's a synchronous function and on error it causes us to abandon
system startup entirely.

This commit fixes the issue with using the incorrect context. It's an
open question whether we should change the behavior around errors when
replaying queued events, falling back to the underlying logger instead
of erroring.
@jefferai jefferai merged commit 602bbcc into main Oct 6, 2023
13 checks passed
@jefferai jefferai deleted the jefferai-fix-event-gating branch October 6, 2023 00:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants