Subsystem benchmarks: determine node CPU usage for 1000 validators and 200 full occupied cores #5035

sandreim · 2024-07-16T15:36:01Z

We should repeat the testing done in #4126 (comment) to get some numbers with the 1kv and 200 cores. This woulld help to better inform the decision and outlook of raising HW specs: https://forum.polkadot.network/t/rfc-increasing-recommended-minimum-core-count-for-reference-hardware/8156

alexggh · 2024-07-17T10:37:00Z

Benchmark specification at #5043 and run at https://gitlab.parity.io/parity/mirrors/polkadot-sdk/-/jobs/6716316.

The specification is with 1000 validators with 200 cores fully occupied, we assumed that the network is in a state where there are 3 no-shows per candidate, the per subsystem usage in that situation is:

CPU usage, seconds                           per block
approval-distribution                           2.1300
approval-voting                                    3.0523
availability-distribution                         0.0347
availability-store                                    0.1906
bitfield-distribution                               0.0484
 ## For 10 candidates per block 7 triggered assignments + 3 for covering the no-shows.
availability-recovery                              1.4412
statement-distribution                           0.0826

Overall, the CPU usage per block for these subsystems is around 7s, we need take into account other unaccounted subsystems, we know networking thread is close 100% usage, so that adds another 6s, and I would add another 5s for all the unaccounted subsystems, that adds up to 18s of CPU usage per block just for the consensus related work.

Assuming after https://forum.polkadot.network/t/rfc-increasing-recommended-minimum-core-count-for-reference-hardware/8156, every validator already migrate to HW with 8 cpu cores the total available execution time is 8 * 6 = 48.

That leaves us with 30s(48 - 18) available for PVF execution, assuming each validator won't have to validate more than 10 candidates per block, that gives us space for the average PVF execution to be around 3s.

In conclusion, with the data we have I think the increase of the validators HW spec from 4 cores to 8 cores, should allow us to support 1kv and 200 cores.

sandreim · 2024-07-17T12:03:48Z

Thanks @alexggh! More than 60% of a nodes CPU used for PVF execution sounds amazing.

You assumed 1 full CPU being used by networking, however with libp2p, it's usage was 250% in versi tests. But we don't know any numbers for litep2p. I think we need to do some benchmarking for the substrate network with similar load as we've done in this test.

burdges · 2024-07-17T22:38:45Z

approval-distribution 2.1300

Is this just because it's lots of messages?

approval-voting 3.0523

Is this verifying assignment VRFs, verifying approval votes, and managing the approvals database?

availability-distribution 0.0347

Is this just sending out the chunks?

availability-store 0.1906

What is here?

bitfield-distribution 0.0484
availability-recovery 1.4412

All erasure code work lies here, no?

We've already using some flavor of systemic chunks, or at least direct fetch here, so this mostly counts only recomputing the availability merkle root. If many nodes go offline then we'd be doing reconstructions here, which doubles or tripples this, right?

statement-distribution 0.0826

This is all in backing, right?

alexggh · 2024-07-18T07:53:41Z

approval-distribution 2.1300

Is this just because it's lots of messages?

Yes, bookkeeping and the logic to gossip scales up with the number of messages and the number of validators in the network.

approval-voting 3.0523

Is this verifying assignment VRFs, verifying approval votes, and managing the approvals database?

Yes.

availability-distribution 0.0347

Is this just sending out the chunks?

Yes.

availability-store 0.1906

What is here?

Storing the chunks

bitfield-distribution 0.0484
availability-recovery 1.4412

All erasure code work lies here, no?

Yes

We've already using some flavor of systemic chunks, or at least direct fetch here, so this mostly counts only recomputing the availability merkle root. If many nodes go offline then we'd be doing reconstructions here, which doubles or tripples this, right?

I have to double check, but I don't think the system-chunks are enabled in this benchmarks.

statement-distribution 0.0826

This is all in backing, right?

Yes.

AndreiEres · 2024-07-18T11:24:47Z

We've already using some flavor of systemic chunks, or at least direct fetch here, so this mostly counts only recomputing the availability merkle root. If many nodes go offline then we'd be doing reconstructions here, which doubles or tripples this, right?

I have to double check, but I don't think the system-chunks are enabled in this benchmarks.

In test used BackersFirstAlways recovery strategy. I assume there is a little chance of chunks because of deterministic nature of benchmarks.

alexggh · 2024-09-09T08:22:18Z

Related to this I've been running a small versi experiment.

Running polkadot in a cpu restrictive environment

In order to determine the stability, readiness and requirements of polkadot for 1k validators and 200 cores, I ran a few experiments where the node hardware requirements where scaled down by a factor of 16 and 8 and pushed the network to the limit.

Translating dimensions by the scale factor

Dimension	Real	Scaled down by 16	Scaled down by 8
CPU allocation	8 cores	0.5 cores	1 core
Number of validators	1000	64	128
Number of full cores/parachains	200	13	25
Pov Execution time	~2s	~2s	~2s
PoV size	~4MiB	~2 MiB	~2 MiB

The goals here was that by scaling every dimension we would gather useful data about the behaviour of the real network, because validators would be CPU starved/throttled, so everything would take longer to execute be delayed.

!!Limitations!! This would probably not reflect very well the load created by things that are scaling with the number of validators and parachains, like assignments and approvals or bitfield processing.

Results

I ended up running three experiments:

Everything scaled down by 16.
Everything scaled down by 8
Everything scaled down by 8, with litep2p

Findings / Confirming what we already assumed or knew

During the whole test the network was stable, block production and finality were stable.
The biggest CPUs hogs by an order of magnitude are libp2p-node and erasure-node tasks.
When scaled by a factor of 16(0.5cpu), only 8 parachains could produce blocks at around ~6s, after that adding more parachains, slowed block production for parachains because availability started being slower, so with the target of 13 parachains, parachain blocks were being produced at around 9s rate.
- Increasing the CPU allocation from 0.5 to 1, automatically made parachains producing blocks at 6s rate
Adding more validators to a network, it is automatically reducing the number of parachain blocks a validators has to check for approving.
When scaled by a factor of 8(1cpu) we were able to support 25 parachains parachains producing blocks at 6s rate.
- Total Time to recover PoV was around 1s and 5s, that means that in this case when we have all parachains producing really big PoVs we were spending more time on PoV recovery than on PoV execution, that raises the question if we need to back-pressure in case nodes start to fall behind on PoV recovery, although I think this is just an artefact of this synthetic scenario.
- Switching the same network to the litep2p, made parachains slow down from 6s rate to around 11s, availability seems to slow down because of this ChannelClogged, to be investigated if this is an artefact of the synthetic scenario or something.

sandreim · 2024-09-09T11:32:16Z

Nice work Alex!

Based on these numbers we should be fine with 1kV and 200 cores, but that also depends on how we deal with the high amount of gossip messages and chunk requests. Subsystem benchmarking should provide a good estimate for this.

What are you planning to do next ?

Switching the same network to the litep2p, made parachains slow down from 6s rate to around 11s, availability seems to slow down because of this ChannelClogged, to be investigated if this is an artefact of the synthetic scenario or something.

Maybe libp2p might be using some unbounded or higher limit channels, litep2p has them bounded at 4096

CC @paritytech/networking

alexggh · 2024-09-09T11:53:34Z

Based on these numbers we should be fine with 1kV and 200 cores, but that also depends on how we deal with the high amount of gossip messages and chunk requests. Subsystem benchmarking should provide a good estimate for this.
What are you planning to do next ?

That's already estimated here: #5035 (comment) and we concluded with 8 cores we should have enough CPUs for it.

So once all the optimisations that we have in progress land on kusama, we should be able to gradually ramp up the number of validators doing parachains consensus work there and do it gradually so we can catch any deterioration and reverse course if our estimations prove wrong.

burdges · 2024-09-10T23:50:39Z

The biggest CPUs hogs by an order of magnitude are libp2p-node and erasure-node tasks.

An order of magnitude? wow. I suppose erasure-node acts like a fixed cost here, due to how backers work. And libp2p-node might've many fixed costs too. That'd be roughly 16x or 8x.

In producton, what hogs the resources? Someone claimed signature verification recently.

alexggh · 2024-09-11T07:21:21Z

In producton, what hogs the resources? Someone claimed signature verification recently.

These are the usages of a real kusama validators.

And these are the usages that I got in restrictive environment:

The PoVs on Kusama are really small when compared with my tests hence why I think erasure-task does not show up as using a lot of cpu on kusama.

Someone claimed signature verification recently.

You mean assignments and approval?, yeah that was a problem as well especially if you have a lot of no-shows, but we did a lot of optimisations on that path as part of this thread: #4849, so it shouldn't be a problem anymore.

Polkadot-Forum · 2024-09-11T12:33:45Z

This issue has been mentioned on Polkadot Forum. There might be relevant details there:

https://forum.polkadot.network/t/litep2p-network-backend-updates/9973/1

sandreim added the T10-tests This PR/Issue is related to tests. label Jul 16, 2024

alexggh self-assigned this Jul 16, 2024

alexggh mentioned this issue Sep 9, 2024

net/litep2p fail to send request ChannelClogged #5645

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Subsystem benchmarks: determine node CPU usage for 1000 validators and 200 full occupied cores #5035

Subsystem benchmarks: determine node CPU usage for 1000 validators and 200 full occupied cores #5035

sandreim commented Jul 16, 2024

alexggh commented Jul 17, 2024 •

edited

Loading

sandreim commented Jul 17, 2024

burdges commented Jul 17, 2024

alexggh commented Jul 18, 2024

AndreiEres commented Jul 18, 2024

alexggh commented Sep 9, 2024

sandreim commented Sep 9, 2024

alexggh commented Sep 9, 2024 •

edited

Loading

burdges commented Sep 10, 2024

alexggh commented Sep 11, 2024 •

edited

Loading

Polkadot-Forum commented Sep 11, 2024

Subsystem benchmarks: determine node CPU usage for 1000 validators and 200 full occupied cores #5035

Subsystem benchmarks: determine node CPU usage for 1000 validators and 200 full occupied cores #5035

Comments

sandreim commented Jul 16, 2024

alexggh commented Jul 17, 2024 • edited Loading

sandreim commented Jul 17, 2024

burdges commented Jul 17, 2024

alexggh commented Jul 18, 2024

AndreiEres commented Jul 18, 2024

alexggh commented Sep 9, 2024

Running polkadot in a cpu restrictive environment

Translating dimensions by the scale factor

Results

Findings / Confirming what we already assumed or knew

sandreim commented Sep 9, 2024

alexggh commented Sep 9, 2024 • edited Loading

burdges commented Sep 10, 2024

alexggh commented Sep 11, 2024 • edited Loading

Polkadot-Forum commented Sep 11, 2024

alexggh commented Jul 17, 2024 •

edited

Loading

alexggh commented Sep 9, 2024 •

edited

Loading

alexggh commented Sep 11, 2024 •

edited

Loading