availability-recovery: move cpu burners in blocking tasks #7417

sandreim · 2023-06-22T10:16:00Z

Fixes #7411

TODO:

Address some availability data cloning issues in the impl
Ensure chunks and backing group recovery works perfectly as before
Versi burn-in analysis

Signed-off-by: Andrei Sandu <andrei-mihail@parity.io>

sandreim · 2023-06-22T10:37:30Z

burning this in on Versi as part of #7378

sandreim · 2023-06-22T13:39:56Z

Versi burn-in shows an impressive CPU utilization of the reconstructed_data_matches_root (re-encode in the screenshot) with 2.5MB PoVs.

Signed-off-by: Andrei Sandu <andrei-mihail@parity.io>

node/network/availability-recovery/src/lib.rs

node/subsystem-types/src/errors.rs

node/network/availability-recovery/src/lib.rs

burdges · 2023-06-23T20:36:03Z

How long do these tasks take to finish?

It'd likely be better for cache locality if https://github.com/paritytech/reed-solomon-novelpoly itself were made multi-threaded, so then its tables could spend less time in L1. It's possible altering the tower of extension fields changes this somewhat, so maybe not the first thing to try, but otoh maybe easy, like just a rayon loop here

Update: L1 is per core, so this is not really true.

node/network/availability-recovery/src/lib.rs

sandreim · 2023-07-03T08:07:49Z

How long do these tasks take to finish?

It'd likely be better for cache locality if https://github.com/paritytech/reed-solomon-novelpoly itself were made multi-threaded, so then its tables could spend less time in L1. It's possible altering the tower of extension fields changes this somewhat, so maybe not the first thing to try, but otoh maybe easy, like just a rayon loop here

Yes, pretty sure we can be faster here, but I expect that in practice the PoVs would rarely be this big (vs current synthetic Glutton test). This change set just makes sure we don't block other async tasks when reconstructing or re-encoding after data has been retrieved.

burdges · 2023-07-03T08:50:27Z

Reencode takes almost 50% longer than reconstruct. I guess reconstruct does some extra work, as reencode should produce 4x as much data.

We should ask soramitsu to look at the rust code, given their leopard fork claims to do this 2x faster.
https://github.com/soramitsu/erasure-coding-crust

Your benchmarks are for 2.5 megs yes? We're actually 4x faster than their claims, but I'm not really sure what their test setup is, like what they mean by VM.

sandreim · 2023-07-03T09:28:33Z

Reencode takes almost 50% longer than reconstruct. I guess reconstruct does some extra work, as reencode should produce 4x as much data.

We should ask soramitsu to look at the rust code, given their leopard fork claims to do this 2x faster. https://github.com/soramitsu/erasure-coding-crust

Your benchmarks are for 2.5 megs yes? We're actually 4x faster than their claims, but I'm not really sure what their test setup is, like what they mean by VM.

Yes, it is 2.5MiB, but we would need to be sure we have similar test env and input to claim that either is faster.

Signed-off-by: Andrei Sandu <andrei-mihail@parity.io>

eskimor

Looks good overall.

Some thoughts:

How quick is the erasure encoding/decoding for very small PoVs ... could it be that the overhead of sending the data to a different thread is not worth it for those?
The architecture of sending to a different thread/task, while we don't have anything else to do is a bit counter intuitive at first - worth adding comments, describing what you are doing here.

node/network/availability-recovery/src/lib.rs

sandreim · 2023-07-03T20:26:12Z

Looks good overall.

Some thoughts:

How quick is the erasure encoding/decoding for very small PoVs ... could it be that the overhead of sending the data to a different thread is not worth it for those?

Min value seems to be 96ms on some node (representative of 32K PoVs), max value is for 2.5MiB PoVs.

The architecture of sending to a different thread/task, while we don't have anything else to do is a bit counter intuitive at first - worth adding comments, describing what you are doing here.

Sure, will add some comments

Signed-off-by: Andrei Sandu <andrei-mihail@parity.io>

sandreim · 2023-07-04T09:50:41Z

bot merge

burdges · 2023-07-04T21:13:47Z

Min value seems to be 96ms on some node (representative of 32K PoVs), max value is for 2.5MiB PoVs.

Is this reconstruction only? It's maybe the cost of building the walsh table. We tried to embed all the tables as constants, although initially we generated them at process start up, but you always need some table specific to the chunks form which your reconstructing.

Interestingly, this table could be reused if you've exactly the same chunks. I doubt this helps us, but maybe it buys you some orthogonality in paritytech/polkadot-sdk#607, although probably no.

sandreim added 4 commits June 22, 2023 10:08

Move expensive computations to blocking thread

91e5086

Signed-off-by: Andrei Sandu <andrei-mihail@parity.io>

fix test

93acbcc

Signed-off-by: Andrei Sandu <andrei-mihail@parity.io>

add internal error and fix dependent subystems

8278a77

Signed-off-by: Andrei Sandu <andrei-mihail@parity.io>

fmt

ecc8701

Signed-off-by: Andrei Sandu <andrei-mihail@parity.io>

sandreim added A3-in_progress Pull request is in progress. No review needed at this stage. B0-silent Changes should not be mentioned in any release notes C1-low PR touches the given topic and has a low impact on builders. labels Jun 22, 2023

fix test fix

4c0f83a

Signed-off-by: Andrei Sandu <andrei-mihail@parity.io>

minor refactor and TODOs

ce8335f

Signed-off-by: Andrei Sandu <andrei-mihail@parity.io>

sandreim marked this pull request as ready for review June 23, 2023 08:36

sandreim requested review from eskimor and ordian June 23, 2023 08:36

ordian reviewed Jun 23, 2023

View reviewed changes

sandreim added A0-please_review Pull request needs code review. and removed A3-in_progress Pull request is in progress. No review needed at this stage. labels Jun 23, 2023

vstakhov approved these changes Jun 23, 2023

View reviewed changes

node/subsystem-types/src/errors.rs Outdated Show resolved Hide resolved

node/network/availability-recovery/src/lib.rs Outdated Show resolved Hide resolved

node/network/availability-recovery/src/lib.rs Outdated Show resolved Hide resolved

alexggh reviewed Jun 28, 2023

View reviewed changes

node/network/availability-recovery/src/lib.rs Show resolved Hide resolved

Impl Feedback for Review

6f0dcdb

Signed-off-by: Andrei Sandu <andrei-mihail@parity.io>

sandreim requested a review from ordian July 3, 2023 13:48

eskimor approved these changes Jul 3, 2023

View reviewed changes

node/network/availability-recovery/src/lib.rs Outdated Show resolved Hide resolved

sandreim added 3 commits July 3, 2023 20:48

review feedback

ab5c932

Signed-off-by: Andrei Sandu <andrei-mihail@parity.io>

More docs

3433174

Signed-off-by: Andrei Sandu <andrei-mihail@parity.io>

add some example timings in comments

dd2b365

Signed-off-by: Andrei Sandu <andrei-mihail@parity.io>

paritytech-processbot bot merged commit 7e1e635 into master Jul 4, 2023
3 checks passed

paritytech-processbot bot deleted the sandreim/av_recovery_thread branch July 4, 2023 09:50

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

availability-recovery: move cpu burners in blocking tasks #7417

availability-recovery: move cpu burners in blocking tasks #7417

sandreim commented Jun 22, 2023 •

edited

Loading

sandreim commented Jun 22, 2023

sandreim commented Jun 22, 2023 •

edited

Loading

burdges commented Jun 23, 2023 •

edited

Loading

sandreim commented Jul 3, 2023 •

edited

Loading

burdges commented Jul 3, 2023 •

edited

Loading

sandreim commented Jul 3, 2023 •

edited

Loading

eskimor left a comment

sandreim commented Jul 3, 2023

sandreim commented Jul 4, 2023

burdges commented Jul 4, 2023

availability-recovery: move cpu burners in blocking tasks #7417

availability-recovery: move cpu burners in blocking tasks #7417

Conversation

sandreim commented Jun 22, 2023 • edited Loading

sandreim commented Jun 22, 2023

sandreim commented Jun 22, 2023 • edited Loading

burdges commented Jun 23, 2023 • edited Loading

sandreim commented Jul 3, 2023 • edited Loading

burdges commented Jul 3, 2023 • edited Loading

sandreim commented Jul 3, 2023 • edited Loading

eskimor left a comment

Choose a reason for hiding this comment

sandreim commented Jul 3, 2023

sandreim commented Jul 4, 2023

burdges commented Jul 4, 2023

sandreim commented Jun 22, 2023 •

edited

Loading

sandreim commented Jun 22, 2023 •

edited

Loading

burdges commented Jun 23, 2023 •

edited

Loading

sandreim commented Jul 3, 2023 •

edited

Loading

burdges commented Jul 3, 2023 •

edited

Loading

sandreim commented Jul 3, 2023 •

edited

Loading