`approval-distribution`: process assignments and votes in parallel #732

sandreim · 2023-01-23T11:53:00Z

Follow up from some earlier scaling tests with paritytech/polkadot#6247. The initial attempts did made it apparent that approval-distribution and approval-voting are bottlenecks increasing approval checking finality lag in our tests with 250 paravalidators and 50 parachains at the time.
That argument still holds true, but only at much larger scale than what we were experimenting at the time. The real issue was paritytech/polkadot#6400 which was solved.

Bumping up to 350+ para validators requires that we make approval-voting and approval-distribution do more work in the same amount of time. The first step is to implement the parallelization changes from paritytech/polkadot#6247 which would result in maxing out cpu usage in approval-voting which sits right now at 80% due to the serial processing and waiting for completion of vote imports issued by approval-distribution. This will enable us to further improve how fast the subsystem churns through work by making approval-voting parallelise signature checks.

Important note: The implementation will make use of internal queueing of messages. We have to be careful to preserve the backpressure so we need these internal queues to be bounded, for example by a fixed number of imports that can be pending at a given time.

The text was updated successfully, but these errors were encountered:

eskimor · 2023-01-23T18:59:26Z

I am still in favor of trying real hard to improve performance/reduce work somehow. Using multiple cores just for signature checking in approval voting ... maxing out multiple cores, even when all parachains are 100% idle? That does sound like something we should be able to improve.

sandreim · 2023-01-24T11:57:26Z

Yes, I created #729 and if I understand correctly it does exactly what you mean and helps a lot.

I think we should implement improvements at the protocol/cryptography level as well as architectural. While in perfect network conditions maybe the changes suggested in #729 might be enough to scale to 500+ validators, but if we go higher or upper tranches trigger or disputes happen then we clearly need a solution to be able to handle bursts of assignments and votes efficiently and scale beyond 1 CPU.

rphmeier · 2023-03-10T01:08:45Z

maxing out multiple cores, even when all parachains are 100% idle? That does sound like something we should be able to improve.

FWIW, we just have these constant overheads for every parachain block. The flip side is that they don't get worse when parachains get heavier.

alexggh · 2023-06-12T10:51:24Z

Spent some time introspecting and profiling the approval voting to understand why we see the high ToF in approval-distribution, here are my findings, in the order of the impact:

We are processing a lot of assignments and votes, which come in bursts, and the crypto for checking them takes 2-3 hundreds of micro-seconds(with rare occasions where it takes 2-3millis) , that capes the full capacity of approval-voting assignments/votest processing to about 3k checks.
At peak performance that's how many assignments/approval we are processing, so if the messages arrive in the queue at about the same time in the approval-distribution queue, just by the nature of processing them serially we are going to see higher than 1s ToF. There are also a lot of messages that contain more than one assignments, so there will be messages that keep approval-distribution busy for a few milliseconds, just because processing all the assignments/votes in the message take time to be processed, see process_incoming_peer_message. Besides this, our system doesn't process assignments/approvals at peak capacity because of the items bellow.

Reducing the assignments from this paritytech/polkadot#6782 will definitely help with the number of messages we have to process.

There are other operations that keep the approval-voting main loop busy, occasionally:

handle_actions(issue_approval) , every 6 seconds we have around 6 messages of this operation which takes 3-4 milliseconds to be processed, with all the useful time being spent in signing the approval here.
ActiveLeaves - takes about 20 millis every 6 seconds.
Occasionally(once or twice a second) approval-voting db flushes takes 1-2 milliseconds

While this operations happen relatively rarely, they do happen at peak time, so they might be just the straw that break the camel's back.

Occasionally(every few seconds), there seems to be situations where approval-distribution is waiting after approval-voting for 3-4 millis, but approval-voting doesn't do any work it is just stuck in the main select my theory is that in those cases the subsystem future doesn't execute because the system is busy executing other tasks. (Do you guys have any other theory) ?

eskimor · 2023-06-12T11:50:40Z

Great work! On signature checking, do we know how much of it is approvals, vs. assignments?

burdges · 2023-06-12T13:24:31Z

ToF?

We need better benchmarks in schnorrkel but verifying an approval should take roughly 40 microseconds, so like ed25519, and verifying an assignment should be less than 3*25 + 40 = 115 micro seconds, so in 1 millisecond on one core you'd maybe 25 approvals, or 8 assignments, or 6 of each.

I've some tricks to speed up the VRF parts..

Actually use all the features correctly, ala Support SIMD on Rust stable dalek-cryptography/curve25519-dalek#520
We can prove multiple VRFs with one signature, which helps our average case some, and really helps if attackers push no-shows on multiple candidates.
Thin VRF merges the verification equations, which maybe saves something right now, but definitely makes batching cheaper.

An elligator hash-to-curve should be fast, but we could cache the hash-to-curve if we uniformized them by excluding the validator public key from the hashed-to-curve. As the extreme we could compute a base point table for those uniformized hash-to-curves, which saves 10 microseconds if not batching.

I'd think sr25519 and ed25519 cost the same, so we'd only speed up approvals verification though either batching or else by adopting a considerably larger but non-batchable signature, like Rabin-Williams. We no longer think network throughout bottlenecks us here?

We already have something placing multiple assignments into the same message? Are the assignments by the same signer or possibly different signers?

handle_actions(issue_approval) , every 6 seconds we have around 6 messages of this operation which takes 3-4 milliseconds to be processed, with all the useful time being spent in signing the approval here.

Why 6? An sr25519 signing operation should take only 15 microseconds, just like ed25519, so maybe the keystore has problems here? We need roughly 25+25+15 = 65 microseconds for VRF signing, btw.

ActiveLeaves - takes about 20 millis every 6 seconds.

Is this querying approvals data or something about the chain?

Occasionally(once or twice a second) approval-voting db flushes takes 1-2 milliseconds

mmap fixes this one. approvals-voting db needs to be serialized in memory for this, but that's not really problematic.

sandreim · 2023-06-12T13:44:19Z

Spent some time introspecting and profiling the approval voting to understand why we see the high ToF in approval-distribution, here are my findings, in the order of the impact:

Nice work @alexggh

We are processing a lot of assignments and votes, which come in bursts, and the crypto for checking them takes 2-3 hundreds of micro-seconds(with rare occasions where it takes 2-3millis) , that capes the full capacity of approval-voting assignments/votest processing to about 3k checks.
At peak performance that's how many assignments/approval we are processing, so if the messages arrive in the queue at about the same time in the approval-distribution queue, just by the nature of processing them serially we are going to see higher than 1s ToF. There are also a lot of messages that contain more than one assignments, so there will be messages that keep approval-distribution busy for a few milliseconds, just because processing all the assignments/votes in the message take time to be processed, see process_incoming_peer_message. Besides this, our system doesn't process assignments/approvals at peak capacity because of the items bellow.

Reducing the assignments from this paritytech/polkadot#6782 will definitely help with the number of messages we have to process.

Yes! But this improvement becomes less useful if we just scale the validators up. It is more useful when we actually need to do more work per validator - more parachains.

There are other operations that keep the approval-voting main loop busy, occasionally:

handle_actions(issue_approval) , every 6 seconds we have around 6 messages of this operation which takes 3-4 milliseconds to be processed, with all the useful time being spent in signing the approval here.

ActiveLeaves - takes about 20 millis every 6 seconds.

Occasionally(once or twice a second) approval-voting db flushes takes 1-2 milliseconds

While this operations happen relatively rarely, they do happen at peak time, so they might be just the straw that break the camel's back.

Occasionally(every few seconds), there seems to be situations where approval-distribution is waiting after approval-voting for 3-4 millis, but approval-voting doesn't do any work it is just stuck in the main select my theory is that in those cases the subsystem future doesn't execute because the system is busy executing other tasks. (Do you guys have any other theory) ?

I also observed this some time ago. My theory back then:

both approval voting and approval distribution run on their own threads (for obvious reasons), but that doesn't mean anything in terms of priority
it might happen that due to system load, the OS will prefer to schedule any other threads vs the approval voting thread.
we can also observe the high CPU load of the networking stack which I believe also plays a role here

alexggh · 2023-06-12T15:43:00Z

ToF?

Time of flight and it defines how much time a message arrived in the approval-distribution waits till it gets picked up for processing.

do we know how much of it is approvals, vs. assignments?

They seem to be evenly spread, e.g 1 second snapshot: 1717 assignments adding up to 640ms of processing time and 2040 approvals adding up to 360ms.

We need better benchmarks in schnorrkel but verifying an approval should take roughly 40 microseconds, so like ed25519, and verifying an assignment should be less than 3*25 + 40 = 115 micro seconds, so in 1 millisecond on one core you'd maybe 25 approvals, or 8 assignments, or 6 of each

I'll dig a bit deeper here, but in both my instrumentation on the tracing we have here assignment imports seem to take 200-300 micro seconds and approvals imports between 100 and 200 micro seconds. Also, I noticed rare cases where checking of a RelayVRFDelay assignment takes 1-2 ms.

handle_actions(issue_approval) , every 6 seconds we have around 6 messages of this operation which takes 3-4 milliseconds to be processed, with all the useful time being spent in signing the approval here.

Why 6?

It seems we are doing it per candidate, so I think we have more at the same time.

An sr25519 signing operation should take only 15 microseconds, just like ed25519, so maybe the keystore has problems here?

What problems are you thinking about ?

We need roughly 25+25+15 = 65 microseconds for VRF signing, btw.

Will re-check it.

ActiveLeaves - takes about 20 millis every 6 seconds.

Is this querying approvals data or something about the chain?

Yes, it is asking the runtime for block information a few times.

I've some tricks to speed up the VRF parts..

Actually use all the features correctly, ala dalek-cryptography/curve25519-dalek#520

Will look to see if this is used on the images I did the testing, if not I'll check what impact it has.

We already have something placing multiple assignments into the same message? Are the assignments by the same signer or possibly different signers?

Are you referring to this paritytech/polkadot#6782, if yes. The answer is no, I wasn't running that PR when I did the measurements.

We no longer think network throughout bottlenecks us here?

The messages are already in the approval-distribution queue, so I would say the network shouldn't be the bottleneck here.

burdges · 2023-06-12T17:14:58Z

We no longer think network throughout bottlenecks us here?

The messages are already in the approval-distribution queue, so I would say the network shouldn't be the bottleneck here.

If I understand, we do not currently know what performance issues the networking itself brings, so we should not yet assume we can make life harder for the networking either.

Actually use all the features correctly, ala dalek-cryptography/curve25519-dalek#520

Will look to see if this is used on the images I did the testing, if not I'll check what impact it has.

Ask @koute maybe?

Is this querying approvals data or something about the chain?

Yes, it is asking the runtime for block information a few times.

Relay VRF preoutputs probably, maybe something else. We copy & keep the relay VRF output outside the chain once we start approvals work, no?

An sr25519 signing operation should take only 15 microseconds, just like ed25519, so maybe the keystore has problems here?

What problems are you thinking about ?

I've no idea how the keystore works internally, but unrealted..

I noticed one reason signing might be slow or unpredictable: We do a syscall every time we sign anything in https://github.com/w3f/schnorrkel/blob/master/src/lib.rs#L228 which gives linux an excuse to bump the thread. I'd previously used ThreadRng but removed it to simplify maintenance in w3f/schnorrkel@c333370 I'll revert that commit, fix rand, and make some other changes.

They seem to be evenly spread, e.g 1 second snapshot: 1717 assignments adding up to 640ms of processing time and 2040 approvals adding up to 360ms.

So 3x and 4x longer than my guestimates for the bare crypto, respectively. We do need better benchmarks on schnorrkel of course, like the rand issue for example.

We already have something placing multiple assignments into the same message? Are the assignments by the same signer or possibly different signers?

Are you referring to this paritytech/polkadot#6782, if yes. The answer is no, I wasn't running that PR when I did the measurements.

No. We discussed batching together other approvals messages too, but maybe no issue exists yet. If we're already signing 6 around the same time then maybe this saves lots.

koute · 2023-06-12T17:24:58Z

Will look to see if this is used on the images I did the testing, if not I'll check what impact it has.

It's not. This still requires passing a special flag to the compiler when compiling polkadot so that the SIMD instructions actually get used.

But I have another PR to curve25519-dalek which makes it work automatically, and it was recently merged in. As soon as there's a new release I'll be integrating that in schnorrkel.

The expected speedup thanks to this is somewhere between ~30%-50%, depending on the hardware.

alexggh · 2023-06-13T07:29:29Z

No. We discussed batching together other approvals messages too, but maybe no issue exists yet. If we're already signing 6 around the same time then maybe this saves lots.

It doesn't exists yet, I think it is this one: #701

Relay VRF preoutputs probably, maybe something else. We copy & keep the relay VRF output outside the chain once we start approvals work, no?

Yes, we do that.

used ThreadRng but removed it to simplify maintenance in w3f/schnorrkel@c333370 I'll revert that commit, fix rand, and make some other changes.

Thank you, I'll try to see if I can obtain some numbers with that reverted.

But I have another PR to curve25519-dalek which makes it work automatically, and it was recently merged in. As soon as there's a new release I'll be integrating that in schnorrkel.

@koute Could you point me to the PR, so I can bring it locally and get out some numbers.

koute · 2023-06-13T08:12:51Z

But I have another PR to curve25519-dalek which makes it work automatically, and it was recently merged in. As soon as there's a new release I'll be integrating that in schnorrkel.

@koute Could you point me to the PR, so I can bring it locally and get out some numbers.

It's merged to curve25519-dalek's main, so if you pull that in it'll automatically work as long as the simd cargo feature is enabled. (Note: there are two backends, and the faster avx512 one requires a nightly compiler; when compiled on a non-nightly compiler it will be silently disabled. The crate will automatically pick the fastest available one at runtime.)

That said, actually integrating it is a little bit more complex because the new version of curve25519-dalek has breaking changes and it needs to be replaced in a few crates (schnorrkel, ed25519-zebra, sp-io), so it's not just a simple job of bumping the deps. I don't know whether you actually want to invest your time into this right now. (Hopefully there will be a new official release of curve25519-dalek soon-ish and then I'll be making the relevant PRs myself to pull it in.)

If you're interested here are some performance numbers from curve25519-dalek's own benchmarks for its two available SIMD backends. I took these on a GCP VM.

Scalar vs AVX2

testcase	scalar	avx2	%
multiscalar benches/Constant-time variable-base multiscalar multiplication/384	5896.6	3886.29	-34.09%
multiscalar benches/Constant-time variable-base multiscalar multiplication/768	11792.0	7778.8	-34.03%
multiscalar benches/Constant-time variable-base multiscalar multiplication/512	7843.0	5194.8	-33.77%
multiscalar benches/Constant-time variable-base multiscalar multiplication/16	271.05	179.74	-33.69%
multiscalar benches/Constant-time variable-base multiscalar multiplication/1024	15724.0	10432.0	-33.66%
multiscalar benches/Constant-time variable-base multiscalar multiplication/64	1005.10	669.14	-33.43%
multiscalar benches/Constant-time variable-base multiscalar multiplication/128	1973.4	1315.19	-33.35%
multiscalar benches/Constant-time variable-base multiscalar multiplication/256	3902.8	2601.79	-33.34%
multiscalar benches/Constant-time variable-base multiscalar multiplication/32	512.99	342.09	-33.31%
multiscalar benches/Constant-time variable-base multiscalar multiplication/8	151.03	102.31	-32.26%
multiscalar benches/Constant-time variable-base multiscalar multiplication/4	91.36	63.62	-30.36%
multiscalar benches/Constant-time variable-base multiscalar multiplication/2	61.51	44.24	-28.07%
multiscalar benches/Variable-time variable-base multiscalar multiplication/1024	6677.2	4906.0	-26.53%
multiscalar benches/Variable-time variable-base multiscalar multiplication/512	3775.3	2794.1	-25.99%
multiscalar benches/Variable-time variable-base multiscalar multiplication/384	3034.5	2246.9	-25.95%
multiscalar benches/Constant-time variable-base multiscalar multiplication/1	46.62	34.53	-25.92%
multiscalar benches/Variable-time variable-base multiscalar multiplication/256	2184.5	1621.3	-25.78%
multiscalar benches/Variable-time variable-base multiscalar multiplication/768	5258.20	3903.9	-25.76%
multiscalar benches/Variable-time mixed-base/(size: 8) (50pct dyn)	99.18	77.43	-21.93%
multiscalar benches/Variable-time variable-base multiscalar multiplication/64	616.72	481.67	-21.90%
multiscalar benches/Variable-time variable-base multiscalar multiplication/128	1201.6	938.67	-21.88%
multiscalar benches/Variable-time variable-base multiscalar multiplication/16	177.74	139.12	-21.73%
multiscalar benches/Variable-time variable-base multiscalar multiplication/32	324.31	253.95	-21.70%
multiscalar benches/Variable-time mixed-base/(size: 4) (50pct dyn)	65.02	50.92	-21.68%
multiscalar benches/Variable-time mixed-base/(size: 16) (50pct dyn)	166.7	130.68	-21.61%
multiscalar benches/Variable-time variable-base multiscalar multiplication/8	104.01	81.73	-21.42%
multiscalar benches/Variable-time mixed-base/(size: 256) (50pct dyn)	2172.5	1712.1	-21.19%
multiscalar benches/Variable-time mixed-base/(size: 64) (50pct dyn)	566.45	446.53	-21.17%
multiscalar benches/Variable-time mixed-base/(size: 32) (50pct dyn)	301.09	237.38	-21.16%
multiscalar benches/Variable-time mixed-base/(size: 384) (50pct dyn)	3249.5	2562.60	-21.14%
multiscalar benches/Variable-time mixed-base/(size: 512) (50pct dyn)	4320.2	3409.60	-21.08%
multiscalar benches/Variable-time mixed-base/(size: 128) (50pct dyn)	1097.6	866.8	-21.03%
multiscalar benches/Variable-time variable-base multiscalar multiplication/4	67.03	52.98	-20.96%
multiscalar benches/Variable-time mixed-base/(size: 2) (50pct dyn)	47.33	37.52	-20.73%
multiscalar benches/Variable-time variable-base multiscalar multiplication/2	48.53	38.59	-20.49%
multiscalar benches/Variable-time mixed-base/(size: 8) (20pct dyn)	92.14	73.28	-20.47%
edwards benches/Constant-time variable-base scalar mul	44.45	35.43	-20.27%
multiscalar benches/Variable-time mixed-base/(size: 16) (20pct dyn)	155.31	124.24	-20.01%
multiscalar benches/Variable-time mixed-base/(size: 4) (20pct dyn)	59.81	47.89	-19.92%
multiscalar benches/Variable-time mixed-base/(size: 4) (0pct dyn)	59.79	47.89	-19.89%
multiscalar benches/Variable-time mixed-base/(size: 8) (0pct dyn)	89.27	71.62	-19.77%
multiscalar benches/Variable-time mixed-base/(size: 2) (20pct dyn)	44.73	35.99	-19.54%
multiscalar benches/Variable-time mixed-base/(size: 2) (0pct dyn)	44.72	35.98	-19.52%
multiscalar benches/Variable-time variable-base multiscalar multiplication/1	39.15	31.54	-19.44%
multiscalar benches/Variable-time mixed-base/(size: 512) (20pct dyn)	3986.4	3218.6	-19.26%
multiscalar benches/Variable-time mixed-base/(size: 256) (20pct dyn)	2007.3	1620.9	-19.25%
multiscalar benches/Variable-time mixed-base/(size: 384) (20pct dyn)	2995.6	2419.79	-19.22%
multiscalar benches/Variable-time mixed-base/(size: 64) (20pct dyn)	521.72	421.92	-19.13%
multiscalar benches/Variable-time mixed-base/(size: 128) (20pct dyn)	1015.09	821.04	-19.12%
multiscalar benches/Variable-time mixed-base/(size: 32) (20pct dyn)	277.22	224.61	-18.98%
multiscalar benches/Variable-time mixed-base/(size: 768) (50pct dyn)	6329.90	5141.4	-18.78%
multiscalar benches/Variable-time mixed-base/(size: 16) (0pct dyn)	146.78	119.67	-18.47%
multiscalar benches/Variable-time mixed-base/(size: 1) (20pct dyn)	37.08	30.24	-18.45%
multiscalar benches/Variable-time mixed-base/(size: 1) (50pct dyn)	37.08	30.24	-18.44%
multiscalar benches/Variable-time mixed-base/(size: 1) (0pct dyn)	37.07	30.24	-18.42%
multiscalar benches/Variable-time mixed-base/(size: 1024) (50pct dyn)	8438.0	6894.3	-18.29%
multiscalar benches/Variable-time fixed-base multiscalar multiplication/2	43.61	35.93	-17.60%
multiscalar benches/Variable-time mixed-base/(size: 384) (0pct dyn)	2808.39	2320.3	-17.38%
multiscalar benches/Variable-time mixed-base/(size: 256) (0pct dyn)	1879.1	1553.39	-17.33%
multiscalar benches/Variable-time mixed-base/(size: 512) (0pct dyn)	3736.60	3090.6	-17.29%
multiscalar benches/Variable-time mixed-base/(size: 32) (0pct dyn)	260.52	215.51	-17.28%
multiscalar benches/Variable-time fixed-base multiscalar multiplication/4	57.79	47.82	-17.25%
multiscalar benches/Variable-time mixed-base/(size: 128) (0pct dyn)	949.81	786.7	-17.17%
multiscalar benches/Variable-time fixed-base multiscalar multiplication/1	36.41	30.16	-17.16%
multiscalar benches/Variable-time mixed-base/(size: 64) (0pct dyn)	488.12	404.41	-17.15%
multiscalar benches/Variable-time mixed-base/(size: 768) (20pct dyn)	5834.1	4864.0	-16.63%
multiscalar benches/Variable-time fixed-base multiscalar multiplication/8	85.91	71.62	-16.63%
multiscalar benches/Variable-time fixed-base multiscalar multiplication/16	141.72	119.42	-15.74%
multiscalar benches/Variable-time mixed-base/(size: 1024) (20pct dyn)	7773.79	6565.1	-15.55%
edwards benches/Variable-time aA+bB A variable B fixed	40.69	34.62	-14.91%
multiscalar benches/Variable-time fixed-base multiscalar multiplication/32	252.32	215.53	-14.58%
multiscalar benches/Variable-time fixed-base multiscalar multiplication/64	473.06	405.01	-14.39%
multiscalar benches/Variable-time fixed-base multiscalar multiplication/256	1819.19	1558.6	-14.32%
multiscalar benches/Variable-time fixed-base multiscalar multiplication/384	2715.5	2328.6	-14.25%
multiscalar benches/Variable-time fixed-base multiscalar multiplication/512	3614.5	3099.8	-14.24%
multiscalar benches/Variable-time fixed-base multiscalar multiplication/128	918.51	788.72	-14.13%
multiscalar benches/Variable-time mixed-base/(size: 768) (0pct dyn)	5413.2	4667.6	-13.77%
multiscalar benches/Variable-time fixed-base multiscalar multiplication/768	5424.29	4683.5	-13.66%
multiscalar benches/Variable-time fixed-base multiscalar multiplication/1024	7260.2	6339.5	-12.68%
multiscalar benches/Variable-time mixed-base/(size: 1024) (0pct dyn)	7231.8	6320.5	-12.60%
edwards benches/Constant-time fixed-base scalar mul	11.56	12.72	9.99%
montgomery benches/Montgomery pseudomultiplication	50.93	55.64	9.25%
montgomery benches/Constant-time fixed-base scalar mul	15.57	16.56	6.36%
scalar benches/Scalar addition	23511000.0	22964000.0	-2.33%
scalar benches/Batch scalar inversion/1	13.10	13.21	0.89%
scalar benches/Scalar inversion	12.81	12.92	0.88%
scalar benches/Batch scalar inversion/4	13.65	13.76	0.85%
scalar benches/Batch scalar inversion/8	14.36	14.48	0.81%
scalar benches/Batch scalar inversion/16	15.80	15.92	0.75%
scalar benches/Batch scalar inversion/2	13.28	13.38	0.74%
ristretto benches/Batch Ristretto double-and-encode/16	13.30	13.21	-0.68%
ristretto benches/Batch Ristretto double-and-encode/8	8.70	8.66	-0.41%
scalar benches/Scalar subtraction	21408000.0	21485000.0	0.36%
scalar benches/Scalar multiplication	91928000.0	91662000.0	-0.29%
ristretto benches/Batch Ristretto double-and-encode/4	6.35	6.34	-0.23%
edwards benches/EdwardsPoint compression	3.95	3.95	0.12%
ristretto benches/RistrettoPoint decompression	4.61	4.61	0.07%
ristretto benches/Batch Ristretto double-and-encode/1	4.58	4.58	-0.06%
edwards benches/EdwardsPoint decompression	4.31	4.32	0.04%
ristretto benches/Batch Ristretto double-and-encode/2	5.18	5.18	-0.01%
ristretto benches/RistrettoPoint compression	4.57	4.57	0.00%

Scalar vs AVX512

testcase	scalar	avx512	%
multiscalar benches/Constant-time variable-base multiscalar multiplication/768	11792.0	5123.8	-56.55%
multiscalar benches/Constant-time variable-base multiscalar multiplication/384	5896.6	2567.0	-56.47%
multiscalar benches/Constant-time variable-base multiscalar multiplication/512	7843.0	3427.9	-56.29%
multiscalar benches/Constant-time variable-base multiscalar multiplication/1024	15724.0	6908.5	-56.06%
multiscalar benches/Constant-time variable-base multiscalar multiplication/256	3902.8	1718.6	-55.96%
multiscalar benches/Constant-time variable-base multiscalar multiplication/128	1973.4	872.49	-55.79%
multiscalar benches/Constant-time variable-base multiscalar multiplication/64	1005.10	449.67	-55.26%
multiscalar benches/Constant-time variable-base multiscalar multiplication/32	512.99	236.33	-53.93%
multiscalar benches/Constant-time variable-base multiscalar multiplication/16	271.05	128.61	-52.55%
multiscalar benches/Variable-time variable-base multiscalar multiplication/768	5258.20	2529.79	-51.89%
multiscalar benches/Variable-time variable-base multiscalar multiplication/384	3034.5	1475.7	-51.37%
multiscalar benches/Variable-time variable-base multiscalar multiplication/1024	6677.2	3251.79	-51.30%
multiscalar benches/Variable-time variable-base multiscalar multiplication/512	3775.3	1854.60	-50.88%
multiscalar benches/Constant-time variable-base multiscalar multiplication/8	151.03	75.82	-49.79%
multiscalar benches/Variable-time variable-base multiscalar multiplication/256	2184.5	1104.0	-49.46%
multiscalar benches/Constant-time variable-base multiscalar multiplication/4	91.36	49.58	-45.72%
multiscalar benches/Constant-time variable-base multiscalar multiplication/2	61.51	35.94	-41.56%
multiscalar benches/Constant-time variable-base multiscalar multiplication/1	46.62	29.46	-36.80%
edwards benches/Constant-time variable-base scalar mul	44.45	29.97	-32.56%
multiscalar benches/Variable-time mixed-base/(size: 512) (50pct dyn)	4320.2	2954.0	-31.62%
multiscalar benches/Variable-time mixed-base/(size: 384) (50pct dyn)	3249.5	2223.6	-31.57%
multiscalar benches/Variable-time mixed-base/(size: 256) (50pct dyn)	2172.5	1490.89	-31.37%
multiscalar benches/Variable-time mixed-base/(size: 128) (50pct dyn)	1097.6	763.06	-30.48%
multiscalar benches/Variable-time variable-base multiscalar multiplication/128	1201.6	835.7	-30.45%
multiscalar benches/Variable-time mixed-base/(size: 512) (20pct dyn)	3986.4	2785.29	-30.13%
multiscalar benches/Variable-time mixed-base/(size: 384) (20pct dyn)	2995.6	2095.1	-30.06%
multiscalar benches/Variable-time mixed-base/(size: 64) (50pct dyn)	566.45	396.4	-30.02%
multiscalar benches/Variable-time variable-base multiscalar multiplication/64	616.72	431.84	-29.98%
multiscalar benches/Variable-time mixed-base/(size: 768) (50pct dyn)	6329.90	4435.2	-29.93%
multiscalar benches/Variable-time mixed-base/(size: 256) (20pct dyn)	2007.3	1406.9	-29.91%
multiscalar benches/Variable-time mixed-base/(size: 1024) (50pct dyn)	8438.0	5921.59	-29.82%
multiscalar benches/Variable-time mixed-base/(size: 32) (50pct dyn)	301.09	212.54	-29.41%
multiscalar benches/Variable-time mixed-base/(size: 128) (20pct dyn)	1015.09	719.19	-29.15%
multiscalar benches/Variable-time variable-base multiscalar multiplication/32	324.31	230.12	-29.04%
multiscalar benches/Variable-time mixed-base/(size: 512) (0pct dyn)	3736.60	2666.6	-28.64%
multiscalar benches/Variable-time mixed-base/(size: 384) (0pct dyn)	2808.39	2007.1	-28.53%
multiscalar benches/Variable-time mixed-base/(size: 16) (50pct dyn)	166.7	119.31	-28.43%
multiscalar benches/Variable-time mixed-base/(size: 768) (20pct dyn)	5834.1	4180.5	-28.34%
multiscalar benches/Variable-time mixed-base/(size: 256) (0pct dyn)	1879.1	1347.39	-28.30%
multiscalar benches/Variable-time mixed-base/(size: 64) (20pct dyn)	521.72	374.48	-28.22%
multiscalar benches/Variable-time mixed-base/(size: 1024) (20pct dyn)	7773.79	5599.5	-27.97%
multiscalar benches/Variable-time variable-base multiscalar multiplication/16	177.74	129.05	-27.39%
multiscalar benches/Variable-time mixed-base/(size: 128) (0pct dyn)	949.81	689.65	-27.39%
multiscalar benches/Variable-time mixed-base/(size: 32) (20pct dyn)	277.22	201.88	-27.18%
multiscalar benches/Variable-time mixed-base/(size: 8) (50pct dyn)	99.18	72.67	-26.73%
multiscalar benches/Variable-time mixed-base/(size: 768) (0pct dyn)	5413.2	3998.3	-26.14%
multiscalar benches/Variable-time mixed-base/(size: 16) (20pct dyn)	155.31	114.76	-26.11%
multiscalar benches/Variable-time mixed-base/(size: 64) (0pct dyn)	488.12	360.78	-26.09%
multiscalar benches/Variable-time mixed-base/(size: 1024) (0pct dyn)	7231.8	5371.8	-25.72%
multiscalar benches/Variable-time fixed-base multiscalar multiplication/1024	7260.2	5398.5	-25.64%
multiscalar benches/Variable-time variable-base multiscalar multiplication/8	104.01	77.50	-25.48%
multiscalar benches/Variable-time mixed-base/(size: 32) (0pct dyn)	260.52	194.85	-25.21%
multiscalar benches/Variable-time fixed-base multiscalar multiplication/768	5424.29	4059.99	-25.15%
multiscalar benches/Variable-time fixed-base multiscalar multiplication/512	3614.5	2708.5	-25.07%
multiscalar benches/Variable-time fixed-base multiscalar multiplication/384	2715.5	2039.8	-24.88%
multiscalar benches/Variable-time fixed-base multiscalar multiplication/256	1819.19	1368.4	-24.78%
multiscalar benches/Variable-time mixed-base/(size: 4) (50pct dyn)	65.02	48.93	-24.75%
multiscalar benches/Variable-time mixed-base/(size: 8) (20pct dyn)	92.14	69.44	-24.64%
multiscalar benches/Variable-time mixed-base/(size: 16) (0pct dyn)	146.78	111.45	-24.07%
multiscalar benches/Variable-time fixed-base multiscalar multiplication/128	918.51	699.73	-23.82%
multiscalar benches/Variable-time variable-base multiscalar multiplication/4	67.03	51.08	-23.79%
multiscalar benches/Variable-time mixed-base/(size: 8) (0pct dyn)	89.27	68.24	-23.56%
multiscalar benches/Variable-time fixed-base multiscalar multiplication/64	473.06	365.51	-22.73%
multiscalar benches/Variable-time mixed-base/(size: 4) (20pct dyn)	59.81	46.43	-22.36%
multiscalar benches/Variable-time mixed-base/(size: 4) (0pct dyn)	59.79	46.43	-22.34%
multiscalar benches/Variable-time variable-base multiscalar multiplication/2	48.53	37.73	-22.25%
multiscalar benches/Variable-time fixed-base multiscalar multiplication/32	252.32	196.43	-22.15%
multiscalar benches/Variable-time mixed-base/(size: 2) (50pct dyn)	47.33	36.85	-22.14%
multiscalar benches/Variable-time variable-base multiscalar multiplication/1	39.15	30.78	-21.38%
multiscalar benches/Variable-time fixed-base multiscalar multiplication/16	141.72	111.78	-21.13%
scalar benches/Scalar inversion	12.81	16.17	26.21%
multiscalar benches/Variable-time fixed-base multiscalar multiplication/8	85.91	68.35	-20.44%
multiscalar benches/Variable-time mixed-base/(size: 2) (20pct dyn)	44.73	35.68	-20.23%
multiscalar benches/Variable-time mixed-base/(size: 2) (0pct dyn)	44.72	35.68	-20.20%
multiscalar benches/Variable-time fixed-base multiscalar multiplication/4	57.79	46.42	-19.68%
multiscalar benches/Variable-time mixed-base/(size: 1) (0pct dyn)	37.07	29.93	-19.27%
multiscalar benches/Variable-time mixed-base/(size: 1) (20pct dyn)	37.08	29.94	-19.27%
multiscalar benches/Variable-time mixed-base/(size: 1) (50pct dyn)	37.08	29.95	-19.23%
multiscalar benches/Variable-time fixed-base multiscalar multiplication/2	43.61	35.52	-18.54%
multiscalar benches/Variable-time fixed-base multiscalar multiplication/1	36.41	29.77	-18.25%
edwards benches/Variable-time aA+bB A variable B fixed	40.69	33.53	-17.60%
montgomery benches/Constant-time fixed-base scalar mul	15.57	17.22	10.62%
scalar benches/Scalar multiplication	91928000.0	99047000.0	7.74%
edwards benches/Constant-time fixed-base scalar mul	11.56	12.40	7.29%
ristretto benches/Batch Ristretto double-and-encode/16	13.30	14.25	7.14%
montgomery benches/Montgomery pseudomultiplication	50.93	54.34	6.68%
scalar benches/Scalar addition	23511000.0	22191000.0	-5.61%
ristretto benches/Batch Ristretto double-and-encode/8	8.70	9.21	5.90%
ristretto benches/Batch Ristretto double-and-encode/4	6.35	6.60	3.86%
scalar benches/Batch scalar inversion/16	15.80	16.32	3.29%
ristretto benches/RistrettoPoint compression	4.57	4.69	2.67%
scalar benches/Batch scalar inversion/8	14.36	14.72	2.50%
ristretto benches/Batch Ristretto double-and-encode/2	5.18	5.31	2.47%
ristretto benches/RistrettoPoint decompression	4.61	4.71	2.23%
scalar benches/Batch scalar inversion/4	13.65	13.95	2.18%
edwards benches/EdwardsPoint decompression	4.31	4.39	1.72%
scalar benches/Batch scalar inversion/1	13.10	13.31	1.61%
scalar benches/Batch scalar inversion/2	13.28	13.47	1.45%
ristretto benches/Batch Ristretto double-and-encode/1	4.58	4.64	1.35%
scalar benches/Scalar subtraction	21408000.0	21139000.0	-1.26%
edwards benches/EdwardsPoint compression	3.95	3.95	0.05%

AVX2 vs AVX512

testcase	avx2	avx512	%
multiscalar benches/Variable-time variable-base multiscalar multiplication/768	3903.9	2529.79	-35.20%
multiscalar benches/Variable-time variable-base multiscalar multiplication/384	2246.9	1475.7	-34.32%
multiscalar benches/Constant-time variable-base multiscalar multiplication/768	7778.8	5123.8	-34.13%
multiscalar benches/Constant-time variable-base multiscalar multiplication/512	5194.8	3427.9	-34.01%
multiscalar benches/Constant-time variable-base multiscalar multiplication/384	3886.29	2567.0	-33.95%
multiscalar benches/Constant-time variable-base multiscalar multiplication/256	2601.79	1718.6	-33.95%
multiscalar benches/Constant-time variable-base multiscalar multiplication/1024	10432.0	6908.5	-33.78%
multiscalar benches/Variable-time variable-base multiscalar multiplication/1024	4906.0	3251.79	-33.72%
multiscalar benches/Constant-time variable-base multiscalar multiplication/128	1315.19	872.49	-33.66%
multiscalar benches/Variable-time variable-base multiscalar multiplication/512	2794.1	1854.60	-33.62%
multiscalar benches/Constant-time variable-base multiscalar multiplication/64	669.14	449.67	-32.80%
multiscalar benches/Variable-time variable-base multiscalar multiplication/256	1621.3	1104.0	-31.91%
multiscalar benches/Constant-time variable-base multiscalar multiplication/32	342.09	236.33	-30.92%
multiscalar benches/Constant-time variable-base multiscalar multiplication/16	179.74	128.61	-28.45%
multiscalar benches/Constant-time variable-base multiscalar multiplication/8	102.31	75.82	-25.88%
multiscalar benches/Constant-time variable-base multiscalar multiplication/4	63.62	49.58	-22.06%
scalar benches/Scalar inversion	12.92	16.17	25.11%
multiscalar benches/Constant-time variable-base multiscalar multiplication/2	44.24	35.94	-18.76%
edwards benches/Constant-time variable-base scalar mul	35.43	29.97	-15.41%
multiscalar benches/Variable-time mixed-base/(size: 1024) (0pct dyn)	6320.5	5371.8	-15.01%
multiscalar benches/Variable-time fixed-base multiscalar multiplication/1024	6339.5	5398.5	-14.84%
multiscalar benches/Variable-time mixed-base/(size: 1024) (20pct dyn)	6565.1	5599.5	-14.71%
multiscalar benches/Constant-time variable-base multiscalar multiplication/1	34.53	29.46	-14.68%
multiscalar benches/Variable-time mixed-base/(size: 768) (0pct dyn)	4667.6	3998.3	-14.34%
multiscalar benches/Variable-time mixed-base/(size: 1024) (50pct dyn)	6894.3	5921.59	-14.11%
multiscalar benches/Variable-time mixed-base/(size: 768) (20pct dyn)	4864.0	4180.5	-14.05%
multiscalar benches/Variable-time mixed-base/(size: 768) (50pct dyn)	5141.4	4435.2	-13.74%
multiscalar benches/Variable-time mixed-base/(size: 512) (0pct dyn)	3090.6	2666.6	-13.72%
multiscalar benches/Variable-time mixed-base/(size: 384) (0pct dyn)	2320.3	2007.1	-13.50%
multiscalar benches/Variable-time mixed-base/(size: 512) (20pct dyn)	3218.6	2785.29	-13.46%
multiscalar benches/Variable-time mixed-base/(size: 384) (20pct dyn)	2419.79	2095.1	-13.42%
multiscalar benches/Variable-time mixed-base/(size: 512) (50pct dyn)	3409.60	2954.0	-13.36%
multiscalar benches/Variable-time fixed-base multiscalar multiplication/768	4683.5	4059.99	-13.31%
multiscalar benches/Variable-time mixed-base/(size: 256) (0pct dyn)	1553.39	1347.39	-13.26%
multiscalar benches/Variable-time mixed-base/(size: 384) (50pct dyn)	2562.60	2223.6	-13.23%
multiscalar benches/Variable-time mixed-base/(size: 256) (20pct dyn)	1620.9	1406.9	-13.20%
multiscalar benches/Variable-time mixed-base/(size: 256) (50pct dyn)	1712.1	1490.89	-12.92%
multiscalar benches/Variable-time fixed-base multiscalar multiplication/512	3099.8	2708.5	-12.62%
multiscalar benches/Variable-time mixed-base/(size: 128) (20pct dyn)	821.04	719.19	-12.40%
multiscalar benches/Variable-time fixed-base multiscalar multiplication/384	2328.6	2039.8	-12.40%
multiscalar benches/Variable-time mixed-base/(size: 128) (0pct dyn)	786.7	689.65	-12.34%
multiscalar benches/Variable-time fixed-base multiscalar multiplication/256	1558.6	1368.4	-12.20%
multiscalar benches/Variable-time mixed-base/(size: 128) (50pct dyn)	866.8	763.06	-11.97%
multiscalar benches/Variable-time fixed-base multiscalar multiplication/128	788.72	699.73	-11.28%
multiscalar benches/Variable-time mixed-base/(size: 64) (20pct dyn)	421.92	374.48	-11.24%
multiscalar benches/Variable-time mixed-base/(size: 64) (50pct dyn)	446.53	396.4	-11.23%
multiscalar benches/Variable-time variable-base multiscalar multiplication/128	938.67	835.7	-10.97%
multiscalar benches/Variable-time mixed-base/(size: 64) (0pct dyn)	404.41	360.78	-10.79%
multiscalar benches/Variable-time mixed-base/(size: 32) (50pct dyn)	237.38	212.54	-10.46%
multiscalar benches/Variable-time variable-base multiscalar multiplication/64	481.67	431.84	-10.35%
multiscalar benches/Variable-time mixed-base/(size: 32) (20pct dyn)	224.61	201.88	-10.12%
multiscalar benches/Variable-time fixed-base multiscalar multiplication/64	405.01	365.51	-9.75%
multiscalar benches/Variable-time mixed-base/(size: 32) (0pct dyn)	215.51	194.85	-9.59%
multiscalar benches/Variable-time variable-base multiscalar multiplication/32	253.95	230.12	-9.38%
multiscalar benches/Variable-time fixed-base multiscalar multiplication/32	215.53	196.43	-8.86%
multiscalar benches/Variable-time mixed-base/(size: 16) (50pct dyn)	130.68	119.31	-8.70%
multiscalar benches/Variable-time mixed-base/(size: 16) (20pct dyn)	124.24	114.76	-7.63%
scalar benches/Scalar multiplication	91662000.0	99047000.0	8.06%
ristretto benches/Batch Ristretto double-and-encode/16	13.21	14.25	7.88%
multiscalar benches/Variable-time variable-base multiscalar multiplication/16	139.12	129.05	-7.24%
multiscalar benches/Variable-time mixed-base/(size: 16) (0pct dyn)	119.67	111.45	-6.87%
multiscalar benches/Variable-time fixed-base multiscalar multiplication/16	119.42	111.78	-6.40%
multiscalar benches/Variable-time mixed-base/(size: 8) (50pct dyn)	77.43	72.67	-6.15%
ristretto benches/Batch Ristretto double-and-encode/8	8.66	9.21	6.33%
multiscalar benches/Variable-time mixed-base/(size: 8) (20pct dyn)	73.28	69.44	-5.24%
multiscalar benches/Variable-time variable-base multiscalar multiplication/8	81.73	77.50	-5.17%
multiscalar benches/Variable-time mixed-base/(size: 8) (0pct dyn)	71.62	68.24	-4.72%
multiscalar benches/Variable-time fixed-base multiscalar multiplication/8	71.62	68.35	-4.58%
ristretto benches/Batch Ristretto double-and-encode/4	6.34	6.60	4.10%
multiscalar benches/Variable-time mixed-base/(size: 4) (50pct dyn)	50.92	48.93	-3.92%
montgomery benches/Constant-time fixed-base scalar mul	16.56	17.22	4.01%
multiscalar benches/Variable-time variable-base multiscalar multiplication/4	52.98	51.08	-3.58%
scalar benches/Scalar addition	22964000.0	22191000.0	-3.37%
edwards benches/Variable-time aA+bB A variable B fixed	34.62	33.53	-3.16%
multiscalar benches/Variable-time mixed-base/(size: 4) (0pct dyn)	47.89	46.43	-3.06%
multiscalar benches/Variable-time mixed-base/(size: 4) (20pct dyn)	47.89	46.43	-3.05%
multiscalar benches/Variable-time fixed-base multiscalar multiplication/4	47.82	46.42	-2.93%
ristretto benches/RistrettoPoint compression	4.57	4.69	2.67%
scalar benches/Batch scalar inversion/16	15.92	16.32	2.52%
edwards benches/Constant-time fixed-base scalar mul	12.72	12.40	-2.45%
ristretto benches/Batch Ristretto double-and-encode/2	5.18	5.31	2.48%
multiscalar benches/Variable-time variable-base multiscalar multiplication/1	31.54	30.78	-2.40%
montgomery benches/Montgomery pseudomultiplication	55.64	54.34	-2.35%
multiscalar benches/Variable-time variable-base multiscalar multiplication/2	38.59	37.73	-2.22%
ristretto benches/RistrettoPoint decompression	4.61	4.71	2.16%
multiscalar benches/Variable-time mixed-base/(size: 2) (50pct dyn)	37.52	36.85	-1.78%
edwards benches/EdwardsPoint decompression	4.32	4.39	1.69%
scalar benches/Batch scalar inversion/8	14.48	14.72	1.67%
scalar benches/Scalar subtraction	21485000.0	21139000.0	-1.61%
ristretto benches/Batch Ristretto double-and-encode/1	4.58	4.64	1.41%
multiscalar benches/Variable-time fixed-base multiscalar multiplication/1	30.16	29.77	-1.32%
scalar benches/Batch scalar inversion/4	13.76	13.95	1.32%
multiscalar benches/Variable-time fixed-base multiscalar multiplication/2	35.93	35.52	-1.14%
multiscalar benches/Variable-time mixed-base/(size: 1) (0pct dyn)	30.24	29.93	-1.04%
multiscalar benches/Variable-time mixed-base/(size: 1) (20pct dyn)	30.24	29.94	-1.00%
multiscalar benches/Variable-time mixed-base/(size: 1) (50pct dyn)	30.24	29.95	-0.96%
multiscalar benches/Variable-time mixed-base/(size: 2) (20pct dyn)	35.99	35.68	-0.86%
multiscalar benches/Variable-time mixed-base/(size: 2) (0pct dyn)	35.98	35.68	-0.84%
scalar benches/Batch scalar inversion/1	13.21	13.31	0.72%
scalar benches/Batch scalar inversion/2	13.38	13.47	0.70%
edwards benches/EdwardsPoint compression	3.95	3.95	-0.07%

burdges · 2023-06-13T09:03:12Z

I think the lower level crypto optimizations should be somewhat orthogonal. It's clear they'll happen over the next couple months regardless, so..

If you want to focus on our real protocol optimization that's fine too. Ask @eskimor what he thinks. We do need to know when the crypto optimizations merge of course, so that we do not mistakenly conclude some protocol optimization did stuff it did not do.

We can otoh quickly cut a schnorrkel version without the syscall if you want to see if that's causing threads to be bumped or whatever.

alexggh · 2023-06-13T09:20:43Z

I think the lower level crypto optimizations should be somewhat orthogonal. It's clear they'll happen over the next couple months regardless, so..

Yes, I agree they are orthogonal. I just went this route because I wanted to understand where we spent most of the time. With that in mind, I think in the order of impact we should do the following:

Implement approval-voting: coalesce multiple approvals in a single signed message #701 because it is clear that if we do less signature checks we should gain some performance.
After we get 1) implemented measure it together with the crypto improvements we have in the pipeline and decide if parallelization of the work is actually what will give use the desired gaines.

Don't you agree @sandreim @eskimor @burdges ?

sandreim · 2023-06-13T09:28:50Z

In principle it sounds good, but do we know how much time the signature check takes vs the internal book keeping that is done by approval voting ?

At a scale of 1k paravalidators and 200 parachains, under normal operation with 30 needed approvals and 6 vrf modulo samples, the question we are asking is basically - Can we check 1k approvals faster in a single thread than 6k on multiple threads (assuming we implement parallel processing) ?

burdges · 2023-06-13T12:25:30Z

In principle it sounds good, but do we know how much time the signature check takes vs the internal book keeping that is done by approval voting ?

Yes this was always my question. :)

I suspect this syscall in schnorrkel is a major cost since it lets linux bump the thread.

Can we check 1k approvals faster in a single thread than 6k on multiple threads (assuming we implement parallel processing) ?

Not really, but we could use those other threads for checking more parachain blocks or whatever.

At a high level, we're not exactly in the shoes of a web2 company who once they parallelize then they can simply throw more hardware at the problem. We have the hardware we told validators they could buy, and upgrading that requires community work.

sandreim · 2023-06-13T12:34:30Z

TBH, not sure that the syscall would result in the CPU being given to another thread, but we could actually test the 2 scenarios and see what the result is.

Totally agree with you, the thing is that with current architecture we cannot dynamically scale to process stuff on more than 1 CPU to handle bursty workloads. Assuming 1k para valdiators and 200 paras, I think we need to worry about the specs since that means at most 6-7 candidates to check (1 to back, 6 to approve) and 2x times that with async backing. We could add more validators there so the amount of checks per validator goes down, but that means we need to be able to handle more gossip traffic still.

alexggh · 2023-06-13T12:43:45Z

In principle it sounds good, but do we know how much time the signature check takes vs the internal book keeping that is done by approval voting ?

At least in approval-voting subsystem from 1 second of useful work it spends around 700 millis doing just the crypto operations for either assignment or approval.

I'll get back when I have some number splits for approval-distribution as well.

alexggh · 2023-06-14T09:20:34Z

Some measurements regarding where the time is spent in approval-distribution, in average from 1 second of work 750ms is spent waiting for the approval-voting to check the signatures. So, I would say our book-keeping is not where the main part of the work is happening. Example of 1s breakdown at peak :

15K messages coming in from peers
938 assignments checked in 379ms
1373 approvals checked in 338 ms
the rest(~250ms) book-keeping and forwarding message to the network bridge.

One thing I noticed while performing this measurements is that we've got cases where we get multiple assignments & approvals from peers in the same message, here https://github.com/paritytech/polkadot/blob/master/node/network/approval-distribution/src/lib.rs#L568 and here https://github.com/paritytech/polkadot/blob/master/node/network/approval-distribution/src/lib.rs#L607 but because we process those assignments one by one we are going to fan-out those message into multiple messages sent to the same peer. In average we have 10-15 approvals in those messages, but I see some case where we even get 328 approvals in it all those might translate to a single message for each approval.

I think this inefficiency, is a low-hanging fruit, and should help us with at least reducing the network traffic/load.

sandreim · 2023-06-14T10:37:54Z

Some measurements regarding where the time is spent in approval-distribution, in average from 1 second of work 750ms is spent waiting for the approval-voting to check the signatures. So, I would say our book-keeping is not where the main part of the work is happening. Example of 1s breakdown at peak :

15K messages coming in from peers

938 assignments checked in 379ms

1373 approvals checked in 338 ms

the rest(~250ms) book-keeping and forwarding message to the network bridge.

Thanks for the numbers @alexggh, it appears that book keeping is not eating a lot, but we might be able to improve there as well.

One thing I noticed while performing this measurements is that we've got cases where we get multiple assignments & approvals from peers in the same message, here https://github.com/paritytech/polkadot/blob/master/node/network/approval-distribution/src/lib.rs#L568 and here https://github.com/paritytech/polkadot/blob/master/node/network/approval-distribution/src/lib.rs#L607 but because we process those assignments one by one we are going to fan-out those message into multiple messages sent to the same peer. In average we have 10-15 approvals in those messages, but I see some case where we even get 328 approvals in it all those might translate to a single message for each approval.

I think this inefficiency, is a low-hanging fruit, and should help us with at least reducing the network traffic/load.

Sounds like a good idea, it should clearly reduce the notification counts.

…er approval-voting In, the current implementation every time we process an assignment or an approval that needs checking after the approval voting, we will wait till approval-voting answers the message. Given that approval-voting will execute some signatures checks that take significant time(between 400us and 1 millis) per message, that's where most of the time in the approval-distribution, see https://github.com/paritytech/polkadot/issues/6608#issuecomment-1590942235 for the numbers. So, modify approval-distribution, so that it picks another message from the queue while the approval-voting is busy doing it's work. This will have a few benefits: 1. Better pipelinening of the messages, approval-voting will always have work to do and it won't have to wait for the approval-distribution to send it a message. Additionally, some of the works of the approval-distribution will be executed in parallel with work in approval-voting instead of serially. 2. By allowing approval-distribution to process messages from it's queue while approval-voting confirms that a message is valid we give the approval-distribution the ability to build a better view about what messages other peers already know, so it won't decide to gossip messages to some of it's peers once we confirm that message as being correct. 3. It opens the door for other optimizations in approval-voting subsystem, which would still be the bottleneck. Note! I still expect the amount of work the combo of this two systems can do, to still be bounded by the numbers of signatures checks it has to do, so we would have to stack this with other optimizations we have in the queue. - https://github.com/paritytech/polkadot/issues/6608 - https://github.com/paritytech/polkadot/issues/6831 [] Evaluate impact in versi [] Cleanup code an make CI happy to make the PR meargeable. Signed-off-by: Alexandru Gheorghe <alexandru.gheorghe@parity.io>

…voting In, the current implementation every time we process an assignment or an approval that needs checking in the approval voting, we will wait till approval-voting answers the message. Given that approval-voting will execute some signatures checks that take significant time(between 400us and 1 millis) per message, that's where most of the time in the approval-distribution, see https://github.com/paritytech/polkadot/issues/6608#issuecomment-1590942235 for the numbers. So, modify approval-distribution, so that it picks another message from the queue while the approval-voting is busy doing it's work. This will have a few benefits: 1. Better pipelinening of the messages, approval-voting will always have work to do and it won't have to wait for the approval-distribution to send it a message. Additionally, some of the works of the approval-distribution will be executed in parallel with work in approval-voting instead of serially. 2. By allowing approval-distribution to process messages from it's queue while approval-voting confirms that a message is valid we give the approval-distribution the ability to build a better view about what messages other peers already know, so it won't decide to gossip messages to some of it's peers once we confirm that message as being correct. 3. It opens the door for other optimizations in approval-voting subsystem, which would still be the bottleneck. Note! I still expect the amount of work the combo of this two systems can do, to still be bounded by the numbers of signatures checks it has to do, so we would have to stack this with other optimizations we have in the queue. - https://github.com/paritytech/polkadot/issues/6608 - https://github.com/paritytech/polkadot/issues/6831 [] Evaluate impact in versi [] Cleanup code an make CI happy to make the PR meargeable. Signed-off-by: Alexandru Gheorghe <alexandru.gheorghe@parity.io>

alexggh · 2023-07-25T11:29:53Z

Putting this on-hold in favor of: #701

* chore: format code * format code * add format hook * add lint when commit

alexggh · 2024-05-31T08:39:53Z

Will be addressed with #1617

sandreim added I8-refactor labels Jan 23, 2023

sandreim mentioned this issue Aug 24, 2023

approval-voting: implement parallel processing of signature checks. #731

Open

alexggh self-assigned this May 30, 2023

burdges mentioned this issue Jun 12, 2023

Add rand back w3f/schnorrkel#87

Open

burdges mentioned this issue Jun 13, 2023

approval-voting: coalesce multiple approvals in a single signed message #701

Closed

sandreim mentioned this issue Jun 14, 2023

approval-distribution: don't fan-out multiple assignments/approvals in separate notifications #615

Closed

alexggh mentioned this issue Jun 19, 2023

approval-distribution: process messages while waiting after approval-voting paritytech/polkadot#7393

Draft

burdges mentioned this issue Jun 30, 2023

Idea: Core Groups #607

Open

sandreim mentioned this issue Jul 4, 2023

approval-voting: approvals gossip and lazy signature checks optimizations #604

Closed

alexggh mentioned this issue Jul 10, 2023

[DNM] approval-voting: approvals signature checks optimizations paritytech/polkadot#7482

Open

4 tasks

burdges mentioned this issue Jul 24, 2023

Approval votes by hash inversion vs faster assignment VRFs #593

Open

Sophia-Gold transferred this issue from paritytech/polkadot Aug 24, 2023

the-right-joyce added I4-refactor Code needs refactoring. T8-parachains_engineering and removed I8-refactor labels Aug 25, 2023

sandreim mentioned this issue Sep 18, 2023

Rewrite approval voting and approval distribution as a single subsystem #1617

Open

the-right-joyce removed the T8-parachains_engineering label Oct 23, 2023

claravanstaden pushed a commit to Snowfork/polkadot-sdk that referenced this issue Dec 8, 2023

style: format code (paritytech#732)

23510f0

* chore: format code * format code * add format hook * add lint when commit

alexggh closed this as completed Aug 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`approval-distribution`: process assignments and votes in parallel #732

`approval-distribution`: process assignments and votes in parallel #732

sandreim commented Jan 23, 2023 •

edited

Loading

eskimor commented Jan 23, 2023

sandreim commented Jan 24, 2023

rphmeier commented Mar 10, 2023

alexggh commented Jun 12, 2023 •

edited

Loading

eskimor commented Jun 12, 2023

burdges commented Jun 12, 2023

sandreim commented Jun 12, 2023 •

edited

Loading

alexggh commented Jun 12, 2023 •

edited

Loading

burdges commented Jun 12, 2023 •

edited

Loading

koute commented Jun 12, 2023 •

edited

Loading

alexggh commented Jun 13, 2023

koute commented Jun 13, 2023

burdges commented Jun 13, 2023 •

edited

Loading

alexggh commented Jun 13, 2023

sandreim commented Jun 13, 2023 •

edited

Loading

burdges commented Jun 13, 2023

sandreim commented Jun 13, 2023

alexggh commented Jun 13, 2023

alexggh commented Jun 14, 2023 •

edited

Loading

sandreim commented Jun 14, 2023

alexggh commented Jul 25, 2023

alexggh commented May 31, 2024

approval-distribution: process assignments and votes in parallel #732

approval-distribution: process assignments and votes in parallel #732

Comments

sandreim commented Jan 23, 2023 • edited Loading

eskimor commented Jan 23, 2023

sandreim commented Jan 24, 2023

rphmeier commented Mar 10, 2023

alexggh commented Jun 12, 2023 • edited Loading

eskimor commented Jun 12, 2023

burdges commented Jun 12, 2023

sandreim commented Jun 12, 2023 • edited Loading

alexggh commented Jun 12, 2023 • edited Loading

burdges commented Jun 12, 2023 • edited Loading

koute commented Jun 12, 2023 • edited Loading

alexggh commented Jun 13, 2023

koute commented Jun 13, 2023

burdges commented Jun 13, 2023 • edited Loading

alexggh commented Jun 13, 2023

sandreim commented Jun 13, 2023 • edited Loading

burdges commented Jun 13, 2023

sandreim commented Jun 13, 2023

alexggh commented Jun 13, 2023

alexggh commented Jun 14, 2023 • edited Loading

sandreim commented Jun 14, 2023

alexggh commented Jul 25, 2023

alexggh commented May 31, 2024

`approval-distribution`: process assignments and votes in parallel #732

`approval-distribution`: process assignments and votes in parallel #732

sandreim commented Jan 23, 2023 •

edited

Loading

alexggh commented Jun 12, 2023 •

edited

Loading

sandreim commented Jun 12, 2023 •

edited

Loading

alexggh commented Jun 12, 2023 •

edited

Loading

burdges commented Jun 12, 2023 •

edited

Loading

koute commented Jun 12, 2023 •

edited

Loading

burdges commented Jun 13, 2023 •

edited

Loading

sandreim commented Jun 13, 2023 •

edited

Loading

alexggh commented Jun 14, 2023 •

edited

Loading