Skip to content

Commit

Permalink
Fix intermittent startup issues on slow systems (#3803)
Browse files Browse the repository at this point in the history
With a suitably slow system -- say, if you're waiting on a Windows
Terminal to output text inside a virtual machine -- the expected order
of things can be thrown off. One specific way this can happen is that
the worker can attempt to dial and have its dial fail because of the
slowness causing the dial context to time out. If, because of the
slowness, we also haven't released the log gate yet, then the event with
the failure information will be queued, along with the context that was
used.

Unfortunately, in some of these error cases, the context that was used
was the dial context instead of the system base context. In many other
places in the function it was the system base context so this is just a
mismatch, probably from code written at different times.

Normally this wouldn't be a problem as we'd fall back to the underlying
logger, but when we release the log gate, things happen differently:
it's a synchronous function and on error it causes us to abandon
system startup entirely.

This commit fixes the issue with using the incorrect context. It's an
open question whether we should change the behavior around errors when
replaying queued events, falling back to the underlying logger instead
of erroring.
  • Loading branch information
jefferai authored Oct 6, 2023
1 parent fc0d434 commit 602bbcc
Show file tree
Hide file tree
Showing 2 changed files with 7 additions and 5 deletions.
2 changes: 2 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -56,6 +56,8 @@ Canonical reference for changes, improvements, and bugfixes for Boundary.
* cli: Fix issue when using the `authenticate` command against a password auth
method on Windows where the password would be swallowed when the login name is
submitted ([PR](https://github.com/hashicorp/boundary/pull/3800))
* worker: Fix an issue that could cause intermittent startup issues on slow
systems ([PR](https://github.com/hashicorp/boundary/pull/3803))

## 0.13.1 (2023/07/10)

Expand Down
10 changes: 5 additions & 5 deletions internal/daemon/worker/controller_connection.go
Original file line number Diff line number Diff line change
Expand Up @@ -162,12 +162,12 @@ func (w *Worker) upstreamDialerFunc(extraAlpnProtos ...string) func(context.Cont
default:
// In this case, event, so that the operator can understand that
// it was rejected
event.WriteError(ctx, op, fmt.Errorf("controller rejected activation token as invalid"))
event.WriteError(w.baseContext, op, fmt.Errorf("controller rejected activation token as invalid"))
return nil, errors.Wrap(w.baseContext, err, op)
}

default:
event.WriteError(ctx, op, err)
event.WriteError(w.baseContext, op, err)
return nil, errors.Wrap(w.baseContext, err, op)
}

Expand All @@ -176,7 +176,7 @@ func (w *Worker) upstreamDialerFunc(extraAlpnProtos ...string) func(context.Cont
w.everAuthenticated.Store(authenticationStatusFirstAuthentication)
}

event.WriteSysEvent(ctx, op, "worker has successfully authenticated")
event.WriteSysEvent(w.baseContext, op, "worker has successfully authenticated")
}

return conn, err
Expand Down Expand Up @@ -204,13 +204,13 @@ func (w *Worker) v1KmsAuthDialFn(ctx context.Context, addr string, extraAlpnProt
written, err := tlsConn.Write([]byte(authInfo.ConnectionNonce))
if err != nil {
if err := nonTlsConn.Close(); err != nil {
event.WriteError(ctx, op, err, event.WithInfoMsg("error closing connection after writing failure"))
event.WriteError(w.baseContext, op, err, event.WithInfoMsg("error closing connection after writing failure"))
}
return nil, fmt.Errorf("unable to write connection nonce: %w", err)
}
if written != len(authInfo.ConnectionNonce) {
if err := nonTlsConn.Close(); err != nil {
event.WriteError(ctx, op, err, event.WithInfoMsg("error closing connection after writing failure"))
event.WriteError(w.baseContext, op, err, event.WithInfoMsg("error closing connection after writing failure"))
}
return nil, fmt.Errorf("expected to write %d bytes of connection nonce, wrote %d", len(authInfo.ConnectionNonce), written)
}
Expand Down

0 comments on commit 602bbcc

Please sign in to comment.