-
Notifications
You must be signed in to change notification settings - Fork 690
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Versi high number of PeerDisconnect when scaling up number of validators and parachains #1263
Comments
Is it possible to get |
Will do. |
serban300
pushed a commit
to serban300/polkadot-sdk
that referenced
this issue
Apr 8, 2024
serban300
pushed a commit
to serban300/polkadot-sdk
that referenced
this issue
Apr 8, 2024
serban300
pushed a commit
to serban300/polkadot-sdk
that referenced
this issue
Apr 8, 2024
serban300
pushed a commit
to serban300/polkadot-sdk
that referenced
this issue
Apr 8, 2024
serban300
pushed a commit
to serban300/polkadot-sdk
that referenced
this issue
Apr 9, 2024
serban300
pushed a commit
to serban300/polkadot-sdk
that referenced
this issue
Apr 9, 2024
serban300
pushed a commit
to serban300/polkadot-sdk
that referenced
this issue
Apr 9, 2024
serban300
pushed a commit
to serban300/polkadot-sdk
that referenced
this issue
Apr 9, 2024
serban300
pushed a commit
to serban300/polkadot-sdk
that referenced
this issue
Apr 9, 2024
serban300
pushed a commit
to serban300/polkadot-sdk
that referenced
this issue
Apr 9, 2024
serban300
pushed a commit
to serban300/polkadot-sdk
that referenced
this issue
Apr 10, 2024
serban300
pushed a commit
to serban300/polkadot-sdk
that referenced
this issue
Apr 10, 2024
Stale issue. |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
During my testing of #1191 I observed there is a high number of PeerDisconnects when running ~500 validators and ~100 parachains.
PeerDisconnects are very bad and when a high number of them happen the network slows down and gets flooded with storms of unnecessary messages, for example in approval-distribution on reconnect, all peers will try to send their known messages to a peer: https://github.com/paritytech/polkadot-sdk/blob/master/polkadot/node/network/approval-distribution/src/lib.rs#L383 which amounts for thousands of messages all going to the same node from multiple sources.
Things I investigated/noticed.
But this is definitely, not the root cause, because I disabled this reputation update and the Disconnects continued reproducing.
When the disconnects start happening, because on reconnect we sent all the knowledge that we have to the peer, we enter a loop where the peer will complain we send them duplicates and create more disconnects.
To exclude 2 & 3, I disabled the behaviour with this 53f8556 and started logging errors when BANNED_THRESHOLD is passed with this: 5e004e1
With 4) I ran 380 validators & 90 parachains and continuously see a lot of disconnects here: https://grafana.teleport.parity.io/goto/G7di2ZzSR?orgId=1, and almost 0 of them because of reputation changes.
With 10 parachains I increased the number of validators 60, 160, 260 .. 380, the disconnects seem to stay within reasonable bounds until we got to 260 validators, but after that we seem to always have 4-5 disconnects per minute for nodes, see here https://grafana.teleport.parity.io/goto/iOrfTZkIg?orgId=1
Once I increase the number of parachains to 90, disconnects spiked even more, see here https://grafana.teleport.parity.io/goto/13V_oZzSg?orgId=1
Note! The dashboards are measuring only the disconnects for
protocol="validation/2"
which is what the validators talk between each other.Relevant timeline and dashboards:
The text was updated successfully, but these errors were encountered: