Reboot Discovery Listener by nisdas · Pull Request #14487 · OffchainLabs/prysm

nisdas · 2024-09-27T09:08:13Z

What type of PR is this?

Feature

What does this PR do? Why is it needed?

Allows users to enable the discovery rebooting feature in the event of connectivity drops. In specifically users running on windows in particular, constant peer drops have been reported. This PR adds in an automatic discovery rebooter which binds a new listener in that event. This is gated behind a feature flag so only affected users can run with it.

Which issues(s) does this PR fix?

Fixes #8144 #13936

Other notes for review

Acknowledgements

I have read CONTRIBUTING.md.
I have made an appropriate entry to CHANGELOG.md.
I have added a description to this PR with sufficient context for reviewers to understand this PR.

into rebootListener

beacon-chain/p2p/discovery.go

nalepae · 2024-10-08T21:15:10Z

beacon-chain/p2p/discovery.go

 }

+func (s *Service) isBelowThreshold() bool {
+	maxPeers := int(s.cfg.MaxPeers)


Here we cast uint to int.
Maybe it would be safer to do the opposite?

The main reason I did it this way is because some of the other methods that we call all return int instead of uint. We would have to cast those to uint too in this particular case. Since this is a user defined value I think it is fine to expect it to be a manageable value.

beacon-chain/p2p/discovery.go

nalepae · 2024-10-08T21:30:10Z

beacon-chain/p2p/discovery.go

+	return l.listener.Lookup(id)
+}
+
+func (l *listenerWrapper) Resolve(node *enode.Node) *enode.Node {


Not tested.

So this particular method isn't used by us anywhere at all which is why it isn't tested. I have to implement these methods as they are needed to satisfy the Listener interface. I can add it in a test, but it will purely cosmetic as we have no need for this method. The same applies for the other cases raised

nalepae · 2024-10-08T21:30:15Z

beacon-chain/p2p/discovery.go

+	return l.listener.RandomNodes()
+}
+
+func (l *listenerWrapper) Ping(node *enode.Node) error {


Not tested.

nalepae · 2024-10-08T21:30:20Z

beacon-chain/p2p/discovery.go

+	return l.listener.Ping(node)
+}
+
+func (l *listenerWrapper) RequestENR(node *enode.Node) (*enode.Node, error) {


Not tested.

nalepae

I see two issues in the design:

If the outbound count is progressing, but if after 5 minutes the threshold is not reached, then the reboot will be triggered.

Example with outBoundThreshold = 10:

Minute 1: outBoundCount = 3
Minute 2: outBoundCount = 5
Minute 3: outBoundCount = 6
Minute 4: outBoundCount = 8
Minute 5: outBoundCount = 9

Then the reboot will occur, while the outBoundCount is always progressing in the good direction.

The thresholdCount in never decreased.
Example with outBoundThreshold = 10:

Minute 1: outBoundCount = 3, thresholdCount = 1
Minute 2: outBoundCount = 5, thresholdCount = 2
Minute 3: outBoundCount = 8, thresholdCount = 3
Minute 4: outBoundCount = 9, thresholdCount = 4
Minute 5: outBoundCount = 13, thresholdCount = 4 (no increase of thresholdCount)
Minute 6: outBoundCount = 15, thresholdCount = 4 (no increase of thresholdCount)
...
Minute 60: outBoundCount = 15, thresholdCount = 4 (no increase of thresholdCount)
Minute 61: outBoundCount = 9, thresholdCount = 5 ==> Immediate reboot, even if for 50 minutes the outBoundThreshold was high enough.

beacon-chain/p2p/discovery.go

into rebootListener

nisdas · 2024-10-09T07:07:56Z

I see two issues in the design:

If the outbound count is progressing, but if after 5 minutes the threshold is not reached, then the reboot will be triggered.

Example with outBoundThreshold = 10:

Minute 1: outBoundCount = 3

Minute 2: outBoundCount = 5

Minute 3: outBoundCount = 6

Minute 4: outBoundCount = 8

Minute 5: outBoundCount = 9

Then the reboot will occur, while the outBoundCount is always progressing in the good direction.

The thresholdCount in never decreased.
Example with outBoundThreshold = 10:

Minute 1: outBoundCount = 3, thresholdCount = 1

Minute 2: outBoundCount = 5, thresholdCount = 2

Minute 3: outBoundCount = 8, thresholdCount = 3

Minute 4: outBoundCount = 9, thresholdCount = 4

Minute 5: outBoundCount = 13, thresholdCount = 4 (no increase of thresholdCount)

Minute 6: outBoundCount = 15, thresholdCount = 4 (no increase of thresholdCount)
...

Minute 60: outBoundCount = 15, thresholdCount = 4 (no increase of thresholdCount)

Minute 61: outBoundCount = 9, thresholdCount = 5 ==> Immediate reboot, even if for 50 minutes the outBoundThreshold was high enough.

I fixed the 2nd point brought up by resetting it if we cross the threshold, for the first one I the reason I have done it so is so that we can quickly reboot a listener which is failing a connectivity check. The other option is that we track the previous count and reset it if it increased, an issue with doing it this way is that the count is still below our threshold. So you might simply be delaying the inevitable reboot by resetting it. Also if the count never reaches the threshold, doing it this way would be a suboptimal outcome.

nalepae · 2024-10-09T07:48:51Z

beacon-chain/p2p/discovery.go

 			// Reboot listener if connectivity drops
 			if thresholdCount > 5 {
-				log.Warnf("Rebooting discovery listener, reached threshold. The current outbound connection count is %d", len(s.peers.OutboundConnected()))
+				log.WithField("Outbound Connection Count", len(s.peers.OutboundConnected())).Warn("Rebooting discovery listener, reached threshold.")


In the codebase, our fields are camelCased: outboundConnectionCount.

nalepae · 2024-10-09T07:55:49Z

beacon-chain/p2p/discovery.go

-
-			if peerInfo == nil {
+			if !s.isBelowOutboundPeerThreshold() {
+				// Reset counter if we are below the threshold count


if we are below ==> if we are beyond?

honcho26 · 2024-12-12T04:41:48Z

Seems like a known issue with a known fix. Can I get the fix on a branch/beta release?

nisdas · 2024-12-12T04:46:02Z

@honcho26 You can try this rc for now if you are up for it:
https://github.com/prysmaticlabs/prysm/releases/tag/v5.1.3-rc.0

honcho26 · 2024-12-12T19:06:41Z

I'd be happy to give it a try. What do I need to add to my command line to set the outbound lower threshold?

nisdas · 2024-12-13T03:38:36Z

You can try running with this flag:

--enable-discovery-reboot

honcho26 · 2024-12-15T00:47:37Z

@nisdas, I tried out the release candidate with that flag, and it shows that it's enabled when Prysm starts up, but my outbound peers still go to zero and the same behavior occurs....missed attestations.

nisdas added 6 commits July 2, 2024 20:15

Add Current Changes To Routine

b107fad

Merge branch 'develop' of https://github.com/prysmaticlabs/geth-sharding

b47b987

into rebootListener

Add In New Test

40894d7

Add Feature Flag

9a0f853

Merge branch 'develop' of https://github.com/prysmaticlabs/geth-sharding

90f9282

into rebootListener

Add Discovery Rebooter feature

1738eeb

nisdas marked this pull request as ready for review October 8, 2024 09:31

nisdas requested a review from a team as a code owner October 8, 2024 09:31

nisdas requested review from james-prysm, kasey and nalepae October 8, 2024 09:31

nisdas added Ready For Review windows Anything related only to Windows OS networking labels Oct 8, 2024

nalepae reviewed Oct 8, 2024

View reviewed changes