Skip to content

Reboot Discovery Listener#14487

Merged
nisdas merged 12 commits intodevelopfrom
rebootListener
Oct 10, 2024
Merged

Reboot Discovery Listener#14487
nisdas merged 12 commits intodevelopfrom
rebootListener

Conversation

@nisdas
Copy link
Copy Markdown
Contributor

@nisdas nisdas commented Sep 27, 2024

What type of PR is this?

Feature

What does this PR do? Why is it needed?

Allows users to enable the discovery rebooting feature in the event of connectivity drops. In specifically users running on windows in particular, constant peer drops have been reported. This PR adds in an automatic discovery rebooter which binds a new listener in that event. This is gated behind a feature flag so only affected users can run with it.

Which issues(s) does this PR fix?

Fixes #8144 #13936

Other notes for review

Acknowledgements

  • I have read CONTRIBUTING.md.
  • I have made an appropriate entry to CHANGELOG.md.
  • I have added a description to this PR with sufficient context for reviewers to understand this PR.

@nisdas nisdas marked this pull request as ready for review October 8, 2024 09:31
@nisdas nisdas requested a review from a team as a code owner October 8, 2024 09:31
@nisdas nisdas added Ready For Review windows Anything related only to Windows OS networking labels Oct 8, 2024
}

func (s *Service) isBelowThreshold() bool {
maxPeers := int(s.cfg.MaxPeers)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here we cast uint to int.
Maybe it would be safer to do the opposite?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The main reason I did it this way is because some of the other methods that we call all return int instead of uint. We would have to cast those to uint too in this particular case. Since this is a user defined value I think it is fine to expect it to be a manageable value.

return l.listener.Lookup(id)
}

func (l *listenerWrapper) Resolve(node *enode.Node) *enode.Node {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not tested.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this particular method isn't used by us anywhere at all which is why it isn't tested. I have to implement these methods as they are needed to satisfy the Listener interface. I can add it in a test, but it will purely cosmetic as we have no need for this method. The same applies for the other cases raised

return l.listener.RandomNodes()
}

func (l *listenerWrapper) Ping(node *enode.Node) error {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not tested.

return l.listener.Ping(node)
}

func (l *listenerWrapper) RequestENR(node *enode.Node) (*enode.Node, error) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not tested.

Copy link
Copy Markdown
Contributor

@nalepae nalepae left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see two issues in the design:

  1. If the outbound count is progressing, but if after 5 minutes the threshold is not reached, then the reboot will be triggered.

Example with outBoundThreshold = 10:

  • Minute 1: outBoundCount = 3
  • Minute 2: outBoundCount = 5
  • Minute 3: outBoundCount = 6
  • Minute 4: outBoundCount = 8
  • Minute 5: outBoundCount = 9

Then the reboot will occur, while the outBoundCount is always progressing in the good direction.

  1. The thresholdCount in never decreased.
    Example with outBoundThreshold = 10:
  • Minute 1: outBoundCount = 3, thresholdCount = 1
  • Minute 2: outBoundCount = 5, thresholdCount = 2
  • Minute 3: outBoundCount = 8, thresholdCount = 3
  • Minute 4: outBoundCount = 9, thresholdCount = 4
  • Minute 5: outBoundCount = 13, thresholdCount = 4 (no increase of thresholdCount)
  • Minute 6: outBoundCount = 15, thresholdCount = 4 (no increase of thresholdCount)
    ...
  • Minute 60: outBoundCount = 15, thresholdCount = 4 (no increase of thresholdCount)
  • Minute 61: outBoundCount = 9, thresholdCount = 5 ==> Immediate reboot, even if for 50 minutes the outBoundThreshold was high enough.

@nisdas
Copy link
Copy Markdown
Contributor Author

nisdas commented Oct 9, 2024

I see two issues in the design:

  1. If the outbound count is progressing, but if after 5 minutes the threshold is not reached, then the reboot will be triggered.

Example with outBoundThreshold = 10:

  • Minute 1: outBoundCount = 3
  • Minute 2: outBoundCount = 5
  • Minute 3: outBoundCount = 6
  • Minute 4: outBoundCount = 8
  • Minute 5: outBoundCount = 9

Then the reboot will occur, while the outBoundCount is always progressing in the good direction.

  1. The thresholdCount in never decreased.
    Example with outBoundThreshold = 10:
  • Minute 1: outBoundCount = 3, thresholdCount = 1
  • Minute 2: outBoundCount = 5, thresholdCount = 2
  • Minute 3: outBoundCount = 8, thresholdCount = 3
  • Minute 4: outBoundCount = 9, thresholdCount = 4
  • Minute 5: outBoundCount = 13, thresholdCount = 4 (no increase of thresholdCount)
  • Minute 6: outBoundCount = 15, thresholdCount = 4 (no increase of thresholdCount)
    ...
  • Minute 60: outBoundCount = 15, thresholdCount = 4 (no increase of thresholdCount)
  • Minute 61: outBoundCount = 9, thresholdCount = 5 ==> Immediate reboot, even if for 50 minutes the outBoundThreshold was high enough.

I fixed the 2nd point brought up by resetting it if we cross the threshold, for the first one I the reason I have done it so is so that we can quickly reboot a listener which is failing a connectivity check. The other option is that we track the previous count and reset it if it increased, an issue with doing it this way is that the count is still below our threshold. So you might simply be delaying the inevitable reboot by resetting it. Also if the count never reaches the threshold, doing it this way would be a suboptimal outcome.

// Reboot listener if connectivity drops
if thresholdCount > 5 {
log.Warnf("Rebooting discovery listener, reached threshold. The current outbound connection count is %d", len(s.peers.OutboundConnected()))
log.WithField("Outbound Connection Count", len(s.peers.OutboundConnected())).Warn("Rebooting discovery listener, reached threshold.")
Copy link
Copy Markdown
Contributor

@nalepae nalepae Oct 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the codebase, our fields are camelCased: outboundConnectionCount.


if peerInfo == nil {
if !s.isBelowOutboundPeerThreshold() {
// Reset counter if we are below the threshold count
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we are below ==> if we are beyond?

nalepae
nalepae previously approved these changes Oct 9, 2024
@nisdas nisdas added this pull request to the merge queue Oct 10, 2024
Merged via the queue into develop with commit 2c981d5 Oct 10, 2024
@nisdas nisdas deleted the rebootListener branch October 10, 2024 08:45
@honcho26
Copy link
Copy Markdown

Seems like a known issue with a known fix. Can I get the fix on a branch/beta release?

@nisdas
Copy link
Copy Markdown
Contributor Author

nisdas commented Dec 12, 2024

@honcho26 You can try this rc for now if you are up for it:
https://github.com/prysmaticlabs/prysm/releases/tag/v5.1.3-rc.0

@honcho26
Copy link
Copy Markdown

I'd be happy to give it a try. What do I need to add to my command line to set the outbound lower threshold?

@nisdas
Copy link
Copy Markdown
Contributor Author

nisdas commented Dec 13, 2024

You can try running with this flag:

--enable-discovery-reboot

@honcho26
Copy link
Copy Markdown

@nisdas, I tried out the release candidate with that flag, and it shows that it's enabled when Prysm starts up, but my outbound peers still go to zero and the same behavior occurs....missed attestations.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

networking windows Anything related only to Windows OS

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Discovery on Windows is Broken With Clock Drifts

4 participants