Improving blob propagation post-PeerDAS with Decentralized Blob Building #6268

jimmygchen · 2024-08-16T01:30:56Z

Issue Addressed

Built on top of #5829, an optimization came up by @michaelsproul, to fetches blobs from the EL to reduce the delay to block import.

This PR goes further to publish the blobs to the network, which helps improve the resiliency of the network, by having nodes with more resources contribute to blob propagation. This experimental solution is an attempt to solve the self building proposer bandwidth issue discussed on R&D Discord and described in @dankrad's post here.

The benefits of this proposals are:

Reduces block import latency: nodes can retrieve blobs from EL without waiting for them from gossip, hence making blocks attestable earlier.
Improves blob propagation and network resiliency: blob propagation work from is spread out from 1 node to the entire network, which reduces the likelihood of missed block due to delays in propagtion.
Allows scaling without sacrificing decentralization: nodes with more resources will participate in blob building and propagation, allowing nodes with limited bandwidth to continue to produce block post-PeerDAS.

Proposed Changes

Deneb: fetch_blobs_and_publish is triggered after a node has processed a gossip / rpc block and is still missing blob components. Once the node fetches the blob from EL, it then publishes the remaining blobs that hasn't seen on gossip to the network.
PeerDAS: Same trigger as above, however only supernodes will publish data columns that are unseen on gossip to the network.

Next steps:

To maintain low bandwidth for smaller stakers (single validator BN), we could allow some optimisation on block publish behaviour for these nodes only. There are some strategies proposed by @cskiraly to bring the outbound bandwidth requirements for a 32 blobs block to the same level as Deneb (6 blobs). However this wouldn't be recommended for nodes with enough bandwidth.
Collect some realistic metrics for a network with 32 blobs per block.

Challenges:

Current KZG libraries (c-kzg-4844 and rust-eth-kzg) may struggle with constructing large number of cells and proofs at once due to the current memory allocation approach.
Even if we are able reduce the bandwidth usage on the CL side, the bandwidth challenge remains on the EL side, as the node still need to pull the blob transactions into its mempool, to a lesser extent though, because:
- it's dealing with raw blobs (4096kb for 32 blobs) rather than erasure coded blobs
- it's pulled-based (eth/68) hence doesn't incur the same gossip amplification cost (8x) on the CL.

TODO before merging

Wait for spec to be agreed on by EL clients
- P2P clarifications when introducing engine_getBlobsV1 ethereum/consensus-specs#3864
- Define engine_getBlobsV1 ethereum/execution-apis#559

Reference:

@michaelsproul's original PR targeting unstable Get blobs from the EL's blob pool #5829
R&D Discord discussion thread
How to help self builders with blobs by @dankrad
Fetch blobs from EL pool by @dapplion

beacon_node/beacon_chain/src/beacon_chain.rs

beacon_node/beacon_chain/src/fetch_blobs.rs

consensus/types/src/blob_sidecar.rs

jimmygchen · 2024-08-19T11:41:38Z

Some early testing results:

Proposers withholding all blobs propose blocks with blobs with 100% success rate
No outbound bandwidth spike for the full nodes with limited upload

Bandwidth-limited fullnode (cl-01) vs supernode (cl-02):

(Thanks to @KatyaRyazantseva for the dashboard above☺️ )

This shows EL inbound traffic (fetch blos from peers) isn't too bad for MAX 6 blobs
The outbound traffic for EL is less relevant here because it includes sending blobs to CL.

Next steps:

Add more metrics
- Blocks made available via EL blobs
- Number of blobs / data columns from EL blobs published
- EL blob fetch timing
- Compute cells and proof time
Make MAX_BLOBS_PER_BLOCK configurable
Try 32 blobs per block
- EL gas constant update
- Potential update on derived configs
- Potential batching of KZG computation to avoid overflow
Add --limit-blob-publish (single validator only), which allows for lower mesh peers for data columns topics and withholding certain amount of data columns

jimmygchen · 2024-08-20T01:29:12Z

This was originally intended experimental but it's been pretty stable, 800 epochs on devnet and 100% participation, and I think we can merge this.
I'll address the review comments, thanks @dapplion for the review 🙏

dapplion · 2024-08-20T11:25:08Z

Add --limit-blob-publish (single validator only), which allows for lower mesh peers for data columns topics and withholding certain amount of data columns

Bring the attack to prod

beacon_node/beacon_chain/src/kzg_utils.rs

kevaundray · 2024-08-20T11:40:21Z

@jimmygchen Any issues with the kzg libraries?

jimmygchen · 2024-08-20T11:49:41Z

Bring the attack to prod

🙈 Trying to not mention the flag I use for testing. Luckily git blame on that line shows someone else's name haha

Just realized we need the execution-apis PR is merged before merge this - although there's probably no harm getting this one merged as soon as the inclusion is confirmed, as the fetch_blobs function handles this gracefully.

beacon_node/beacon_chain/src/fetch_blobs.rs

beacon_node/beacon_chain/src/beacon_chain.rs

beacon_node/beacon_chain/src/fetch_blobs.rs

beacon_node/network/src/network_beacon_processor/mod.rs

beacon_node/network/src/network_beacon_processor/sync_methods.rs

consensus/types/src/beacon_block_body.rs

jimmygchen · 2024-08-20T12:03:16Z

@jimmygchen Any issues with the kzg libraries?

Nope all good so far! I was just flagging the potential challenges that I could imagine when we increase blob count, but I haven't actually run into any issues yet. Will keep you updated, thanks ☺️

jimmygchen · 2024-08-28T00:33:36Z

engine_getBlobsV1 now implemented in

Nethermind : Add engine_getBlobsV1 NethermindEth/nethermind#7322 (merged)
Reth: Implement engine_getBlobsV1 paradigmxyz/reth#9723 (open)

beacon_node/beacon_chain/src/beacon_chain.rs

beacon_node/beacon_chain/src/data_availability_checker/overflow_lru_cache.rs

beacon_node/beacon_chain/src/fetch_blobs.rs

beacon_node/network/src/network_beacon_processor/gossip_methods.rs

beacon_node/network/src/network_beacon_processor/mod.rs

…to das-fetch-blobs

jimmygchen · 2024-10-01T11:52:02Z

I think we should review blob publishing before merging this to avoid adding excessive bandwidth usage for Deneb.

jimmygchen · 2024-10-14T12:43:45Z

Be great to have #6403 merged first, as there will be some more conflicts that needs to be resolved, and a bit easier in this order. We also need to prioritise block publishing, could do it in this PR.

# Conflicts: # beacon_node/beacon_chain/src/data_availability_checker.rs # beacon_node/beacon_chain/src/data_availability_checker/overflow_lru_cache.rs # beacon_node/execution_layer/src/engine_api.rs # beacon_node/execution_layer/src/engine_api/json_structures.rs # beacon_node/network/src/network_beacon_processor/gossip_methods.rs # beacon_node/network/src/network_beacon_processor/mod.rs # beacon_node/network/src/network_beacon_processor/sync_methods.rs

jimmygchen · 2024-10-17T06:38:50Z

Remaining TODOs

prioritise block publishing (0a2c6f7)
send IDONTWANT on publish
- Send IDONTWANT on publish to avoid downloading data we already have #6513

dapplion · 2024-10-18T12:14:40Z

beacon_node/beacon_chain/src/beacon_chain.rs

+            .await
+    }
+
+    async fn check_engine_blob_availability_and_import(


Should inline this function into process_engine_blobs?

dapplion · 2024-10-18T12:14:52Z

beacon_node/beacon_chain/src/beacon_chain.rs

        // Store the block and its state, and execute the confirmation batch for the intermediate
        // states, which will delete their temporary flags.
        // If the write fails, revert fork choice to the version from disk, else we can
        // end up with blocks in fork choice that are missing from disk.
        // See https://github.com/sigp/lighthouse/issues/2028
        let (_, signed_block, blobs, data_columns) = signed_block.deconstruct();
+        // TODO(das) we currently store all subnet sampled columns. Tracking issue to exclude non custody columns: https://github.com/sigp/lighthouse/issues/6465


Exceeds max fmt width

dapplion · 2024-10-18T12:19:21Z

beacon_node/beacon_chain/src/data_availability_checker.rs

+    /// This DOES NOT perform KZG verification because the KZG proofs should have been constructed
+    /// immediately prior to calling this function so they are assumed to be valid.


Why not change this function and process_engine_blobs to take KzgVerifiedBlobList and make it correct by construction?

dapplion · 2024-10-18T12:21:51Z

beacon_node/beacon_chain/src/data_availability_checker/overflow_lru_cache.rs

-                );
+        let all_blobs_received = block_kzg_commitments_count_opt
+            .map_or(false, |num_expected_blobs| {
+                num_expected_blobs == num_received_blobs


Is it worth switching to >= in case there's some future bug?

Suggested change

num_expected_blobs == num_received_blobs

num_received_blobs >= num_expected_blobs

dapplion · 2024-10-18T12:24:02Z

beacon_node/beacon_chain/src/data_availability_checker/overflow_lru_cache.rs

+            let Some(verified_blobs) = verified_blobs
+                .into_iter()
+                .map(|b| b.map(|b| b.to_blob()))
+                .take(num_blobs_expected)


Why is it necessary to take here? Do we consider the case where we have too many blobs?

dapplion · 2024-10-18T12:32:35Z

beacon_node/beacon_chain/src/fetch_blobs.rs

+            let is_supernode = chain_cloned
+                .data_availability_checker
+                .get_sampling_column_count()
+                == chain_cloned.spec.number_of_columns;


Could add a function to DataAvailabilityChecker to not spread the notion of supernode-ness in multiple places

fn is_supernode(&self) -> bool

dapplion · 2024-10-18T12:37:32Z

beacon_node/beacon_chain/src/fetch_blobs.rs

+                        log,
+                        "Publishing data columns from EL";
+                        "count" => publishable.len()
+                    );


Here we can compute the efficiency of this delay publish mechanism: track which columns were originally scheduled for publishing and became seen in this for loop. We can track that as a counter of "already_seen_before_publish" or something related. And maybe print a summary after the for loop summarizing how many we published in total, and how many we skipped publishing or how many we intended to publish at the beginning.

Also, do we log somewhere which specific column index we publish? Otherwise, it might be helpful to log in this line a vec of indices of which columns we are publishing.

dapplion · 2024-10-18T12:39:39Z

beacon_node/beacon_chain/src/fetch_blobs.rs

+                    *blob_mut = Some(Arc::new(blob));
+                } else {
+                    return Err(FetchEngineBlobError::InternalError(
+                        "Unreachable: Blobs from EL - out of bounds".to_string(),


This is reachable, the EL can send malformed blobs with a random index

dapplion · 2024-10-18T12:44:59Z

beacon_node/network/src/network_beacon_processor/mod.rs

+                    tokio::time::sleep(supernode_data_column_publication_batch_interval).await;
+                }
+            },
+            "handle_data_columns_publish",


Not a blocker to merging, but since reviews are slow maybe there's time to de-dup this bit

dapplion · 2024-10-18T12:46:30Z

beacon_node/src/cli.rs

+        .arg(
+            Arg::new("supernode-data-column-publication-batches")
+                .long("supernode-data-column-publication-batches")
+                .action(ArgAction::Set)
+                .help_heading(FLAG_HEADER)
+                .help("Number of batches that supernodes split data columns into during publishing by a non-proposer. For PeerDAS only.")
+                .display_order(0)
+        )
+        .arg(
+            Arg::new("supernode-data-column-publication-batch-interval")
+                .long("supernode-data-column-publication-batch-interval")
+                .action(ArgAction::Set)
+                .help_heading(FLAG_HEADER)
+                .help("The delay in milliseconds applied by supernodes between the sending of each data column batch. For PeerDAS only.")
+                .display_order(0)
+        )


Should we hide these?

jimmygchen added work-in-progress PR is a work-in-progress das Data Availability Sampling labels Aug 16, 2024

jimmygchen marked this pull request as ready for review August 16, 2024 07:47

jimmygchen added ready-for-review The code is ready for review and removed work-in-progress PR is a work-in-progress labels Aug 16, 2024

dapplion reviewed Aug 16, 2024

View reviewed changes

beacon_node/beacon_chain/src/beacon_chain.rs Show resolved Hide resolved

beacon_node/beacon_chain/src/beacon_chain.rs Outdated Show resolved Hide resolved

dapplion reviewed Aug 16, 2024

View reviewed changes

beacon_node/beacon_chain/src/fetch_blobs.rs Show resolved Hide resolved

dapplion reviewed Aug 16, 2024

View reviewed changes

consensus/types/src/blob_sidecar.rs Outdated Show resolved Hide resolved

jimmygchen mentioned this pull request Aug 19, 2024

Reconstruct data columns without blocking processing and import #5990

Closed

jimmygchen force-pushed the das-fetch-blobs branch from dbf4bb8 to 01b91cb Compare August 20, 2024 02:21

kevaundray reviewed Aug 20, 2024

View reviewed changes

beacon_node/beacon_chain/src/kzg_utils.rs Show resolved Hide resolved

dapplion reviewed Aug 20, 2024

View reviewed changes

jimmygchen mentioned this pull request Aug 20, 2024

P2P clarifications when introducing engine_getBlobsV1 ethereum/consensus-specs#3864

Open

jimmygchen added spec_change A change related to the Eth2 spec optimization Something to make Lighthouse run more efficiently. labels Aug 20, 2024

jimmygchen mentioned this pull request Aug 21, 2024

DAS - Tracking Issue #4983

Open

52 tasks

jimmygchen added waiting-on-author The reviewer has suggested changes and awaits thier implementation. and removed ready-for-review The code is ready for review labels Aug 22, 2024

mergify bot deleted the branch sigp:unstable August 27, 2024 04:10

mergify bot closed this Aug 27, 2024

michaelsproul reopened this Aug 27, 2024

michaelsproul changed the base branch from das to unstable August 27, 2024 04:21

michaelsproul mentioned this pull request Aug 27, 2024

Get blobs from the EL's blob pool #5829

Closed

3 tasks

michaelsproul and others added 4 commits September 13, 2024 13:49

Gradual publication of data columns for supernodes.

6bff4ab

Recompute head after importing block with blobs from the EL.

7977999

Fix lint

5e75527

Merge branch 'unstable' into das-fetch-blobs

3444281

dapplion reviewed Sep 18, 2024

View reviewed changes

jimmygchen added 2 commits September 19, 2024 15:04

Use blocking task instead of async when computing cells.

e76d21f

Merge branch 'das-fetch-blobs' of github.com:jimmygchen/lighthouse in…

4b2956f

…to das-fetch-blobs

jimmygchen added waiting-on-author The reviewer has suggested changes and awaits thier implementation. and removed ready-for-review The code is ready for review labels Sep 19, 2024

michaelsproul added 2 commits September 27, 2024 12:43

Merge remote-tracking branch 'origin/unstable' into das-fetch-blobs

a6fbb3c

Fix semantic conflicts

4639146

Downgrade error log.

41c2e7d

michaelsproul added the v6.0.0 New major release for hierarchical state diffs label Oct 7, 2024

michaelsproul mentioned this pull request Oct 9, 2024

Broadcast blobs early with broadcast_validation=consensus #6477

Open

jimmygchen added the blocked label Oct 14, 2024

jimmygchen removed the blocked label Oct 17, 2024

jimmygchen added 2 commits October 17, 2024 22:07

Merge branch 'unstable' into das-fetch-blobs

a65030b

Publish block without waiting for blob and column proof computation.

0a2c6f7

jimmygchen force-pushed the das-fetch-blobs branch from 63427e9 to 0a2c6f7 Compare October 18, 2024 03:45

Address review comments and refactor.

0a9d5a0

jimmygchen added ready-for-review The code is ready for review and removed waiting-on-author The reviewer has suggested changes and awaits thier implementation. labels Oct 18, 2024

jimmygchen added 3 commits October 18, 2024 19:58

Merge branch 'unstable' into das-fetch-blobs

6511c93

Fix test and docs.

7939888

Comment cleanups.

5ec9756

dapplion reviewed Oct 18, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improving blob propagation post-PeerDAS with Decentralized Blob Building #6268

Improving blob propagation post-PeerDAS with Decentralized Blob Building #6268

jimmygchen commented Aug 16, 2024 •

edited

Loading

jimmygchen commented Aug 19, 2024 •

edited

Loading

jimmygchen commented Aug 20, 2024

dapplion commented Aug 20, 2024

kevaundray commented Aug 20, 2024

jimmygchen commented Aug 20, 2024

jimmygchen commented Aug 20, 2024

jimmygchen commented Aug 28, 2024

jimmygchen commented Oct 1, 2024 •

edited

Loading

jimmygchen commented Oct 14, 2024

jimmygchen commented Oct 17, 2024 •

edited

Loading

dapplion Oct 18, 2024

dapplion Oct 18, 2024

dapplion Oct 18, 2024

dapplion Oct 18, 2024

dapplion Oct 18, 2024

dapplion Oct 18, 2024

dapplion Oct 18, 2024

dapplion Oct 18, 2024

dapplion Oct 18, 2024

dapplion Oct 18, 2024

		/// This DOES NOT perform KZG verification because the KZG proofs should have been constructed
		/// immediately prior to calling this function so they are assumed to be valid.

	num_expected_blobs == num_received_blobs
	num_received_blobs >= num_expected_blobs

Improving blob propagation post-PeerDAS with Decentralized Blob Building #6268

Are you sure you want to change the base?

Improving blob propagation post-PeerDAS with Decentralized Blob Building #6268

Conversation

jimmygchen commented Aug 16, 2024 • edited Loading

Issue Addressed

Proposed Changes

Next steps:

Challenges:

TODO before merging

Reference:

jimmygchen commented Aug 19, 2024 • edited Loading

jimmygchen commented Aug 20, 2024

dapplion commented Aug 20, 2024

kevaundray commented Aug 20, 2024

jimmygchen commented Aug 20, 2024

jimmygchen commented Aug 20, 2024

jimmygchen commented Aug 28, 2024

jimmygchen commented Oct 1, 2024 • edited Loading

jimmygchen commented Oct 14, 2024

jimmygchen commented Oct 17, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jimmygchen commented Aug 16, 2024 •

edited

Loading

jimmygchen commented Aug 19, 2024 •

edited

Loading

jimmygchen commented Oct 1, 2024 •

edited

Loading

jimmygchen commented Oct 17, 2024 •

edited

Loading