Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Topdown consistency checks: Validate number of top down messages #873

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

cryptoAtwill
Copy link
Contributor

@cryptoAtwill cryptoAtwill commented Apr 10, 2024

This is the start of a series of updates to enhance topdown finality robustness and accuracy.

This PR adds a simple check to ensure the number of top down messages received is actually correct. The purpose of this is that after running a few tests, some RPC nodes, especially when querying historical data, might not return events completely. Adding this check will alert this from happening and let throws an error/stops parent syncer for node operator to perform checks.

As the gateway already exposes the applied top down nonce by subnet, one just needs to query the consecutive historical state to deduce the number of top down messages.

Draft: updating unit tests.

@cryptoAtwill cryptoAtwill marked this pull request as draft April 10, 2024 07:05
&self,
height: BlockHeight,
current_view: &ParentViewPayload,
parent_block_hash: &[u8],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't there an alias for block hash at least?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated

Copy link
Contributor

@aakoshh aakoshh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes sense to me 👍

I was wondering about two things:

  1. Whether it's enough to only do this check, or do we have to check the configuration number as well. The answer is that because messages can be empty where the validator changes are not, we have to check both if we want to catch RPCs that don't serve data as soon as possible.
  2. Whether it's enough to do this check in the syncer, or should it also happen during catching up with others, ie. when the block is executed. The answer is that as long as others did the checking before voting on finality, if during execution we do not do the check, but have an RPC without history, we'll get a consensus failure on the next block, when the application hash difference is revealed. Not sure how easy it is to recover from that.

Random thoughts:

  1. Perhaps if someone is confident in the RPC, they can make this check optional in the settings and save some time by not executing extra queries.
  2. Perhaps the nonces could be added to the IPC finality the votes are about, but that would require major refactoring in the voting.

@cryptoAtwill
Copy link
Contributor Author

@aakoshh For:

1. Perhaps if someone is confident in the RPC, they can make this check optional in the settings and save some time by not executing extra queries.

With the past few days, I would say we just check this regardless, I have fewer trust in RPC nodes now. Even if the RPC is not missing any events, it could joined a partitioned network.

2.Perhaps the nonces could be added to the IPC finality the votes are about, but that would require major refactoring in the voting.

Yeah, I think this came up a couple of times, but mostly in bottom up checkpoint. But we still haven't seen a very strong case for the refactoring.

@cryptoAtwill cryptoAtwill marked this pull request as ready for review April 15, 2024 08:54
@@ -215,7 +221,7 @@ where

// Null block received, no block hash for the current height being polled.
// Return the previous parent hash as the non-null block hash.
return Ok(parent_block_hash);
return Ok((parent_block_hash, None));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not return parent_block_nonce to not lose the previous context?

@@ -52,5 +54,6 @@ pub(crate) enum Commands {
PreRelease(PreReleaseArgs),
Propagate(PropagateArgs),
ListTopdownMsgs(ListTopdownMsgsArgs),
ListTopdownNonce(GetTopDownMsgNonceArgs),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be GetTopdownNonce?

Comment on lines +867 to +869
let nonce = if !exists {
log::warn!("subnet does not exists at block hash, return 0 nonce");
0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a particular reason why you opt to return 0 with a warning in the logs, instead of returning None or an Err? I looked at other cases in the file, and it looks in most other cases an error is returned:

Because get_block_hash returns an Err, it means get_validator_changeset and get_topdown_msgs do so as well. You can argue that by the time you call this it will most certainly not return an error since it would have already failed on the previous two, but still, for consistency I think it should behave the same way.

@raulk
Copy link
Contributor

raulk commented May 7, 2024

Recording here that @cryptoAtwill encountered an unexpected roadblock because of this: filecoin-project/lotus#11205. It turns out that eth_call in Lotus behaves non-deterministically depending on how the node operator configured this env variable: LOTUS_SKIP_APPLY_TS_MESSAGE_CALL_WITH_GAS.

On Glif nodes, it's configured to skip the execution of requested block prior to applying the call, resulting in the call being executed on the prestate of the requested height, instead of the poststate (which is the general expectation). Therefore our diff-based consistency check does not pass when importing heights that include events at the end boundary.

This PR attempts to solve this: filecoin-project/lotus#11905, but it seems to be in some form of limbo.

@aarshkshah1992
Copy link

@raulk That PR needs a review from someone who knows the ETH RPC stack really well but Steb is currently fully focused on F3. Would you or Mikers or Fridrick have the time to review it ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants