-
Notifications
You must be signed in to change notification settings - Fork 35
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Topdown consistency checks: Validate number of top down messages #873
base: main
Are you sure you want to change the base?
Conversation
&self, | ||
height: BlockHeight, | ||
current_view: &ParentViewPayload, | ||
parent_block_hash: &[u8], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't there an alias for block hash at least?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This makes sense to me 👍
I was wondering about two things:
- Whether it's enough to only do this check, or do we have to check the configuration number as well. The answer is that because messages can be empty where the validator changes are not, we have to check both if we want to catch RPCs that don't serve data as soon as possible.
- Whether it's enough to do this check in the syncer, or should it also happen during catching up with others, ie. when the block is executed. The answer is that as long as others did the checking before voting on finality, if during execution we do not do the check, but have an RPC without history, we'll get a consensus failure on the next block, when the application hash difference is revealed. Not sure how easy it is to recover from that.
Random thoughts:
- Perhaps if someone is confident in the RPC, they can make this check optional in the settings and save some time by not executing extra queries.
- Perhaps the nonces could be added to the IPC finality the votes are about, but that would require major refactoring in the voting.
@aakoshh For:
With the past few days, I would say we just check this regardless, I have fewer trust in RPC nodes now. Even if the RPC is not missing any events, it could joined a partitioned network.
Yeah, I think this came up a couple of times, but mostly in bottom up checkpoint. But we still haven't seen a very strong case for the refactoring. |
@@ -215,7 +221,7 @@ where | |||
|
|||
// Null block received, no block hash for the current height being polled. | |||
// Return the previous parent hash as the non-null block hash. | |||
return Ok(parent_block_hash); | |||
return Ok((parent_block_hash, None)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not return parent_block_nonce
to not lose the previous context?
@@ -52,5 +54,6 @@ pub(crate) enum Commands { | |||
PreRelease(PreReleaseArgs), | |||
Propagate(PropagateArgs), | |||
ListTopdownMsgs(ListTopdownMsgsArgs), | |||
ListTopdownNonce(GetTopDownMsgNonceArgs), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be GetTopdownNonce
?
let nonce = if !exists { | ||
log::warn!("subnet does not exists at block hash, return 0 nonce"); | ||
0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a particular reason why you opt to return 0 with a warning in the logs, instead of returning None
or an Err
? I looked at other cases in the file, and it looks in most other cases an error is returned:
Because get_block_hash
returns an Err
, it means get_validator_changeset
and get_topdown_msgs
do so as well. You can argue that by the time you call this it will most certainly not return an error since it would have already failed on the previous two, but still, for consistency I think it should behave the same way.
Recording here that @cryptoAtwill encountered an unexpected roadblock because of this: filecoin-project/lotus#11205. It turns out that On Glif nodes, it's configured to skip the execution of requested block prior to applying the call, resulting in the call being executed on the prestate of the requested height, instead of the poststate (which is the general expectation). Therefore our diff-based consistency check does not pass when importing heights that include events at the end boundary. This PR attempts to solve this: filecoin-project/lotus#11905, but it seems to be in some form of limbo. |
@raulk That PR needs a review from someone who knows the ETH RPC stack really well but Steb is currently fully focused on F3. Would you or Mikers or Fridrick have the time to review it ? |
This is the start of a series of updates to enhance topdown finality robustness and accuracy.
This PR adds a simple check to ensure the number of top down messages received is actually correct. The purpose of this is that after running a few tests, some RPC nodes, especially when querying historical data, might not return events completely. Adding this check will alert this from happening and let throws an error/stops parent syncer for node operator to perform checks.
As the gateway already exposes the applied top down nonce by subnet, one just needs to query the consecutive historical state to deduce the number of top down messages.
Draft: updating unit tests.