Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Subsystem benchmarks: determine node CPU usage for 1000 validators and 200 full occupied cores #5035

Open
sandreim opened this issue Jul 16, 2024 · 11 comments
Assignees
Labels
T10-tests This PR/Issue is related to tests.

Comments

@sandreim
Copy link
Contributor

We should repeat the testing done in #4126 (comment) to get some numbers with the 1kv and 200 cores. This woulld help to better inform the decision and outlook of raising HW specs: https://forum.polkadot.network/t/rfc-increasing-recommended-minimum-core-count-for-reference-hardware/8156

@sandreim sandreim added the T10-tests This PR/Issue is related to tests. label Jul 16, 2024
@alexggh alexggh self-assigned this Jul 16, 2024
@alexggh
Copy link
Contributor

alexggh commented Jul 17, 2024

Benchmark specification at #5043 and run at https://gitlab.parity.io/parity/mirrors/polkadot-sdk/-/jobs/6716316.

The specification is with 1000 validators with 200 cores fully occupied, we assumed that the network is in a state where there are 3 no-shows per candidate, the per subsystem usage in that situation is:

CPU usage, seconds                           per block
approval-distribution                           2.1300
approval-voting                                    3.0523
availability-distribution                         0.0347
availability-store                                    0.1906
bitfield-distribution                               0.0484
 ## For 10 candidates per block 7 triggered assignments + 3 for covering the no-shows.
availability-recovery                              1.4412
statement-distribution                           0.0826

Overall, the CPU usage per block for these subsystems is around 7s, we need take into account other unaccounted subsystems, we know networking thread is close 100% usage, so that adds another 6s, and I would add another 5s for all the unaccounted subsystems, that adds up to 18s of CPU usage per block just for the consensus related work.

Assuming after https://forum.polkadot.network/t/rfc-increasing-recommended-minimum-core-count-for-reference-hardware/8156, every validator already migrate to HW with 8 cpu cores the total available execution time is 8 * 6 = 48.

That leaves us with 30s(48 - 18) available for PVF execution, assuming each validator won't have to validate more than 10 candidates per block, that gives us space for the average PVF execution to be around 3s.

In conclusion, with the data we have I think the increase of the validators HW spec from 4 cores to 8 cores, should allow us to support 1kv and 200 cores.

@sandreim
Copy link
Contributor Author

Thanks @alexggh! More than 60% of a nodes CPU used for PVF execution sounds amazing.

You assumed 1 full CPU being used by networking, however with libp2p, it's usage was 250% in versi tests. But we don't know any numbers for litep2p. I think we need to do some benchmarking for the substrate network with similar load as we've done in this test.

@burdges
Copy link

burdges commented Jul 17, 2024

approval-distribution 2.1300

Is this just because it's lots of messages?

approval-voting 3.0523

Is this verifying assignment VRFs, verifying approval votes, and managing the approvals database?

availability-distribution 0.0347

Is this just sending out the chunks?

availability-store 0.1906

What is here?

bitfield-distribution 0.0484
availability-recovery 1.4412

All erasure code work lies here, no?

We've already using some flavor of systemic chunks, or at least direct fetch here, so this mostly counts only recomputing the availability merkle root. If many nodes go offline then we'd be doing reconstructions here, which doubles or tripples this, right?

statement-distribution 0.0826

This is all in backing, right?

@alexggh
Copy link
Contributor

alexggh commented Jul 18, 2024

approval-distribution 2.1300

Is this just because it's lots of messages?

Yes, bookkeeping and the logic to gossip scales up with the number of messages and the number of validators in the network.

approval-voting 3.0523

Is this verifying assignment VRFs, verifying approval votes, and managing the approvals database?

Yes.

availability-distribution 0.0347

Is this just sending out the chunks?

Yes.

availability-store 0.1906

What is here?

Storing the chunks

bitfield-distribution 0.0484
availability-recovery 1.4412

All erasure code work lies here, no?

Yes

We've already using some flavor of systemic chunks, or at least direct fetch here, so this mostly counts only recomputing the availability merkle root. If many nodes go offline then we'd be doing reconstructions here, which doubles or tripples this, right?

I have to double check, but I don't think the system-chunks are enabled in this benchmarks.

statement-distribution 0.0826

This is all in backing, right?

Yes.

@AndreiEres
Copy link
Contributor

We've already using some flavor of systemic chunks, or at least direct fetch here, so this mostly counts only recomputing the availability merkle root. If many nodes go offline then we'd be doing reconstructions here, which doubles or tripples this, right?

I have to double check, but I don't think the system-chunks are enabled in this benchmarks.

In test used BackersFirstAlways recovery strategy. I assume there is a little chance of chunks because of deterministic nature of benchmarks.

@alexggh
Copy link
Contributor

alexggh commented Sep 9, 2024

Related to this I've been running a small versi experiment.

Running polkadot in a cpu restrictive environment

In order to determine the stability, readiness and requirements of polkadot for 1k validators and 200 cores, I ran a few experiments where the node hardware requirements where scaled down by a factor of 16 and 8 and pushed the network to the limit.

Translating dimensions by the scale factor

Dimension Real Scaled down by 16 Scaled down by 8
CPU allocation 8 cores 0.5 cores 1 core
Number of validators 1000 64 128
Number of full cores/parachains 200 13 25
Pov Execution time ~2s ~2s ~2s
PoV size ~4MiB ~2 MiB ~2 MiB

The goals here was that by scaling every dimension we would gather useful data about the behaviour of the real network, because validators would be CPU starved/throttled, so everything would take longer to execute be delayed.

!!Limitations!! This would probably not reflect very well the load created by things that are scaling with the number of validators and parachains, like assignments and approvals or bitfield processing.

Results

I ended up running three experiments:

  1. Everything scaled down by 16.
  2. Everything scaled down by 8
  3. Everything scaled down by 8, with litep2p

Findings / Confirming what we already assumed or knew

  • During the whole test the network was stable, block production and finality were stable.
  • The biggest CPUs hogs by an order of magnitude are libp2p-node and erasure-node tasks.
  • When scaled by a factor of 16(0.5cpu), only 8 parachains could produce blocks at around ~6s, after that adding more parachains, slowed block production for parachains because availability started being slower, so with the target of 13 parachains, parachain blocks were being produced at around 9s rate.
    • Increasing the CPU allocation from 0.5 to 1, automatically made parachains producing blocks at 6s rate
  • Adding more validators to a network, it is automatically reducing the number of parachain blocks a validators has to check for approving.
  • When scaled by a factor of 8(1cpu) we were able to support 25 parachains parachains producing blocks at 6s rate.
    • Total Time to recover PoV was around 1s and 5s, that means that in this case when we have all parachains producing really big PoVs we were spending more time on PoV recovery than on PoV execution, that raises the question if we need to back-pressure in case nodes start to fall behind on PoV recovery, although I think this is just an artefact of this synthetic scenario.
    • Switching the same network to the litep2p, made parachains slow down from 6s rate to around 11s, availability seems to slow down because of this ChannelClogged, to be investigated if this is an artefact of the synthetic scenario or something.

@sandreim
Copy link
Contributor Author

sandreim commented Sep 9, 2024

Nice work Alex!

Based on these numbers we should be fine with 1kV and 200 cores, but that also depends on how we deal with the high amount of gossip messages and chunk requests. Subsystem benchmarking should provide a good estimate for this.

What are you planning to do next ?

  • Switching the same network to the litep2p, made parachains slow down from 6s rate to around 11s, availability seems to slow down because of this ChannelClogged, to be investigated if this is an artefact of the synthetic scenario or something.

Maybe libp2p might be using some unbounded or higher limit channels, litep2p has them bounded at 4096

CC @paritytech/networking

@alexggh
Copy link
Contributor

alexggh commented Sep 9, 2024

Based on these numbers we should be fine with 1kV and 200 cores, but that also depends on how we deal with the high amount of gossip messages and chunk requests. Subsystem benchmarking should provide a good estimate for this.
What are you planning to do next ?

That's already estimated here: #5035 (comment) and we concluded with 8 cores we should have enough CPUs for it.

So once all the optimisations that we have in progress land on kusama, we should be able to gradually ramp up the number of validators doing parachains consensus work there and do it gradually so we can catch any deterioration and reverse course if our estimations prove wrong.

@burdges
Copy link

burdges commented Sep 10, 2024

The biggest CPUs hogs by an order of magnitude are libp2p-node and erasure-node tasks.

An order of magnitude? wow. I suppose erasure-node acts like a fixed cost here, due to how backers work. And libp2p-node might've many fixed costs too. That'd be roughly 16x or 8x.

In producton, what hogs the resources? Someone claimed signature verification recently.

@alexggh
Copy link
Contributor

alexggh commented Sep 11, 2024

In producton, what hogs the resources? Someone claimed signature verification recently.

These are the usages of a real kusama validators.
Screenshot 2024-09-11 at 10 10 32

And these are the usages that I got in restrictive environment:
Screenshot 2024-09-11 at 10 13 47

The PoVs on Kusama are really small when compared with my tests hence why I think erasure-task does not show up as using a lot of cpu on kusama.

Someone claimed signature verification recently.

You mean assignments and approval?, yeah that was a problem as well especially if you have a lot of no-shows, but we did a lot of optimisations on that path as part of this thread: #4849, so it shouldn't be a problem anymore.

@Polkadot-Forum
Copy link

This issue has been mentioned on Polkadot Forum. There might be relevant details there:

https://forum.polkadot.network/t/litep2p-network-backend-updates/9973/1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
T10-tests This PR/Issue is related to tests.
Projects
Status: In Progress
Development

No branches or pull requests

5 participants