Conversation
| } | ||
|
|
||
| func (s *Service) isBelowThreshold() bool { | ||
| maxPeers := int(s.cfg.MaxPeers) |
There was a problem hiding this comment.
Here we cast uint to int.
Maybe it would be safer to do the opposite?
There was a problem hiding this comment.
The main reason I did it this way is because some of the other methods that we call all return int instead of uint. We would have to cast those to uint too in this particular case. Since this is a user defined value I think it is fine to expect it to be a manageable value.
| return l.listener.Lookup(id) | ||
| } | ||
|
|
||
| func (l *listenerWrapper) Resolve(node *enode.Node) *enode.Node { |
There was a problem hiding this comment.
So this particular method isn't used by us anywhere at all which is why it isn't tested. I have to implement these methods as they are needed to satisfy the Listener interface. I can add it in a test, but it will purely cosmetic as we have no need for this method. The same applies for the other cases raised
| return l.listener.RandomNodes() | ||
| } | ||
|
|
||
| func (l *listenerWrapper) Ping(node *enode.Node) error { |
| return l.listener.Ping(node) | ||
| } | ||
|
|
||
| func (l *listenerWrapper) RequestENR(node *enode.Node) (*enode.Node, error) { |
There was a problem hiding this comment.
I see two issues in the design:
- If the outbound count is progressing, but if after 5 minutes the threshold is not reached, then the reboot will be triggered.
Example with outBoundThreshold = 10:
- Minute 1:
outBoundCount = 3 - Minute 2:
outBoundCount = 5 - Minute 3:
outBoundCount = 6 - Minute 4:
outBoundCount = 8 - Minute 5:
outBoundCount = 9
Then the reboot will occur, while the outBoundCount is always progressing in the good direction.
- The
thresholdCountin never decreased.
Example withoutBoundThreshold = 10:
- Minute 1:
outBoundCount = 3, thresholdCount = 1 - Minute 2:
outBoundCount = 5, thresholdCount = 2 - Minute 3:
outBoundCount = 8, thresholdCount = 3 - Minute 4:
outBoundCount = 9, thresholdCount = 4 - Minute 5:
outBoundCount = 13, thresholdCount = 4(no increase ofthresholdCount) - Minute 6:
outBoundCount = 15, thresholdCount = 4(no increase ofthresholdCount)
... - Minute 60:
outBoundCount = 15, thresholdCount = 4(no increase ofthresholdCount) - Minute 61:
outBoundCount = 9, thresholdCount = 5==> Immediate reboot, even if for 50 minutes theoutBoundThresholdwas high enough.
I fixed the 2nd point brought up by resetting it if we cross the threshold, for the first one I the reason I have done it so is so that we can quickly reboot a listener which is failing a connectivity check. The other option is that we track the previous count and reset it if it increased, an issue with doing it this way is that the |
beacon-chain/p2p/discovery.go
Outdated
| // Reboot listener if connectivity drops | ||
| if thresholdCount > 5 { | ||
| log.Warnf("Rebooting discovery listener, reached threshold. The current outbound connection count is %d", len(s.peers.OutboundConnected())) | ||
| log.WithField("Outbound Connection Count", len(s.peers.OutboundConnected())).Warn("Rebooting discovery listener, reached threshold.") |
There was a problem hiding this comment.
In the codebase, our fields are camelCased: outboundConnectionCount.
beacon-chain/p2p/discovery.go
Outdated
|
|
||
| if peerInfo == nil { | ||
| if !s.isBelowOutboundPeerThreshold() { | ||
| // Reset counter if we are below the threshold count |
There was a problem hiding this comment.
if we are below ==> if we are beyond?
|
Seems like a known issue with a known fix. Can I get the fix on a branch/beta release? |
|
@honcho26 You can try this rc for now if you are up for it: |
|
I'd be happy to give it a try. What do I need to add to my command line to set the outbound lower threshold? |
|
You can try running with this flag: |
|
@nisdas, I tried out the release candidate with that flag, and it shows that it's enabled when Prysm starts up, but my outbound peers still go to zero and the same behavior occurs....missed attestations. |
What type of PR is this?
Feature
What does this PR do? Why is it needed?
Allows users to enable the discovery rebooting feature in the event of connectivity drops. In specifically users running on windows in particular, constant peer drops have been reported. This PR adds in an automatic discovery rebooter which binds a new listener in that event. This is gated behind a feature flag so only affected users can run with it.
Which issues(s) does this PR fix?
Fixes #8144 #13936
Other notes for review
Acknowledgements