-
Notifications
You must be signed in to change notification settings - Fork 740
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improving blob propagation post-PeerDAS with Decentralized Blob Building #6268
base: unstable
Are you sure you want to change the base?
Conversation
Some early testing results:
Bandwidth-limited fullnode (
This shows EL inbound traffic (fetch blos from peers) isn't too bad for MAX 6 blobs Next steps:
|
This was originally intended experimental but it's been pretty stable, 800 epochs on devnet and 100% participation, and I think we can merge this. |
dbf4bb8
to
01b91cb
Compare
Bring the attack to prod |
@jimmygchen Any issues with the kzg libraries? |
🙈 Trying to not mention the flag I use for testing. Luckily Just realized we need the |
Nope all good so far! I was just flagging the potential challenges that I could imagine when we increase blob count, but I haven't actually run into any issues yet. Will keep you updated, thanks |
|
beacon_node/beacon_chain/src/data_availability_checker/overflow_lru_cache.rs
Outdated
Show resolved
Hide resolved
beacon_node/beacon_chain/src/data_availability_checker/overflow_lru_cache.rs
Outdated
Show resolved
Hide resolved
beacon_node/beacon_chain/src/data_availability_checker/overflow_lru_cache.rs
Outdated
Show resolved
Hide resolved
beacon_node/network/src/network_beacon_processor/gossip_methods.rs
Outdated
Show resolved
Hide resolved
I think we should review blob publishing before merging this to avoid adding excessive bandwidth usage for Deneb. |
Be great to have #6403 merged first, as there will be some more conflicts that needs to be resolved, and a bit easier in this order. We also need to prioritise block publishing, could do it in this PR. |
# Conflicts: # beacon_node/beacon_chain/src/data_availability_checker.rs # beacon_node/beacon_chain/src/data_availability_checker/overflow_lru_cache.rs # beacon_node/execution_layer/src/engine_api.rs # beacon_node/execution_layer/src/engine_api/json_structures.rs # beacon_node/network/src/network_beacon_processor/gossip_methods.rs # beacon_node/network/src/network_beacon_processor/mod.rs # beacon_node/network/src/network_beacon_processor/sync_methods.rs
Remaining TODOs
|
63427e9
to
0a2c6f7
Compare
.await | ||
} | ||
|
||
async fn check_engine_blob_availability_and_import( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should inline this function into process_engine_blobs?
// Store the block and its state, and execute the confirmation batch for the intermediate | ||
// states, which will delete their temporary flags. | ||
// If the write fails, revert fork choice to the version from disk, else we can | ||
// end up with blocks in fork choice that are missing from disk. | ||
// See https://github.com/sigp/lighthouse/issues/2028 | ||
let (_, signed_block, blobs, data_columns) = signed_block.deconstruct(); | ||
// TODO(das) we currently store all subnet sampled columns. Tracking issue to exclude non custody columns: https://github.com/sigp/lighthouse/issues/6465 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Exceeds max fmt width
/// This DOES NOT perform KZG verification because the KZG proofs should have been constructed | ||
/// immediately prior to calling this function so they are assumed to be valid. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not change this function and process_engine_blobs
to take KzgVerifiedBlobList
and make it correct by construction?
); | ||
let all_blobs_received = block_kzg_commitments_count_opt | ||
.map_or(false, |num_expected_blobs| { | ||
num_expected_blobs == num_received_blobs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it worth switching to >=
in case there's some future bug?
num_expected_blobs == num_received_blobs | |
num_received_blobs >= num_expected_blobs |
let Some(verified_blobs) = verified_blobs | ||
.into_iter() | ||
.map(|b| b.map(|b| b.to_blob())) | ||
.take(num_blobs_expected) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is it necessary to take here? Do we consider the case where we have too many blobs?
let is_supernode = chain_cloned | ||
.data_availability_checker | ||
.get_sampling_column_count() | ||
== chain_cloned.spec.number_of_columns; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could add a function to DataAvailabilityChecker
to not spread the notion of supernode-ness in multiple places
fn is_supernode(&self) -> bool
log, | ||
"Publishing data columns from EL"; | ||
"count" => publishable.len() | ||
); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here we can compute the efficiency of this delay publish mechanism: track which columns were originally scheduled for publishing and became seen in this for loop. We can track that as a counter of "already_seen_before_publish" or something related. And maybe print a summary after the for loop summarizing how many we published in total, and how many we skipped publishing or how many we intended to publish at the beginning.
Also, do we log somewhere which specific column index we publish? Otherwise, it might be helpful to log in this line a vec of indices of which columns we are publishing.
*blob_mut = Some(Arc::new(blob)); | ||
} else { | ||
return Err(FetchEngineBlobError::InternalError( | ||
"Unreachable: Blobs from EL - out of bounds".to_string(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is reachable, the EL can send malformed blobs with a random index
tokio::time::sleep(supernode_data_column_publication_batch_interval).await; | ||
} | ||
}, | ||
"handle_data_columns_publish", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not a blocker to merging, but since reviews are slow maybe there's time to de-dup this bit
.arg( | ||
Arg::new("supernode-data-column-publication-batches") | ||
.long("supernode-data-column-publication-batches") | ||
.action(ArgAction::Set) | ||
.help_heading(FLAG_HEADER) | ||
.help("Number of batches that supernodes split data columns into during publishing by a non-proposer. For PeerDAS only.") | ||
.display_order(0) | ||
) | ||
.arg( | ||
Arg::new("supernode-data-column-publication-batch-interval") | ||
.long("supernode-data-column-publication-batch-interval") | ||
.action(ArgAction::Set) | ||
.help_heading(FLAG_HEADER) | ||
.help("The delay in milliseconds applied by supernodes between the sending of each data column batch. For PeerDAS only.") | ||
.display_order(0) | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we hide these?
Issue Addressed
Built on top of #5829, an optimization came up by @michaelsproul, to fetches blobs from the EL to reduce the delay to block import.
This PR goes further to publish the blobs to the network, which helps improve the resiliency of the network, by having nodes with more resources contribute to blob propagation. This experimental solution is an attempt to solve the self building proposer bandwidth issue discussed on R&D Discord and described in @dankrad's post here.
The benefits of this proposals are:
Proposed Changes
fetch_blobs_and_publish
is triggered after a node has processed a gossip / rpc block and is still missing blob components. Once the node fetches the blob from EL, it then publishes the remaining blobs that hasn't seen on gossip to the network.Next steps:
Challenges:
c-kzg-4844
andrust-eth-kzg
) may struggle with constructing large number of cells and proofs at once due to the current memory allocation approach.eth/68
) hence doesn't incur the same gossip amplification cost (8x) on the CL.TODO before merging
engine_getBlobsV1
ethereum/consensus-specs#3864engine_getBlobsV1
ethereum/execution-apis#559Reference:
unstable
Get blobs from the EL's blob pool #5829