Topdown consistency checks: Validate number of top down messages #873

cryptoAtwill · 2024-04-10T07:02:30Z

This is the start of a series of updates to enhance topdown finality robustness and accuracy.

This PR adds a simple check to ensure the number of top down messages received is actually correct. The purpose of this is that after running a few tests, some RPC nodes, especially when querying historical data, might not return events completely. Adding this check will alert this from happening and let throws an error/stops parent syncer for node operator to perform checks.

As the gateway already exposes the applied top down nonce by subnet, one just needs to query the consecutive historical state to deduce the number of top down messages.

Draft: updating unit tests.

fendermint/vm/topdown/src/finality/fetch.rs

fendermint/vm/topdown/src/finality/mod.rs

aakoshh · 2024-04-10T09:43:22Z

fendermint/vm/topdown/src/sync/syncer.rs

+        &self,
+        height: BlockHeight,
+        current_view: &ParentViewPayload,
+        parent_block_hash: &[u8],


Isn't there an alias for block hash at least?

fendermint/vm/topdown/src/sync/syncer.rs

aakoshh

This makes sense to me 👍

I was wondering about two things:

Whether it's enough to only do this check, or do we have to check the configuration number as well. The answer is that because messages can be empty where the validator changes are not, we have to check both if we want to catch RPCs that don't serve data as soon as possible.
Whether it's enough to do this check in the syncer, or should it also happen during catching up with others, ie. when the block is executed. The answer is that as long as others did the checking before voting on finality, if during execution we do not do the check, but have an RPC without history, we'll get a consensus failure on the next block, when the application hash difference is revealed. Not sure how easy it is to recover from that.

Random thoughts:

Perhaps if someone is confident in the RPC, they can make this check optional in the settings and save some time by not executing extra queries.
Perhaps the nonces could be added to the IPC finality the votes are about, but that would require major refactoring in the voting.

cryptoAtwill · 2024-04-15T08:54:23Z

@aakoshh For:

1. Perhaps if someone is confident in the RPC, they can make this check optional in the settings and save some time by not executing extra queries.

With the past few days, I would say we just check this regardless, I have fewer trust in RPC nodes now. Even if the RPC is not missing any events, it could joined a partitioned network.

2.Perhaps the nonces could be added to the IPC finality the votes are about, but that would require major refactoring in the voting.

Yeah, I think this came up a couple of times, but mostly in bottom up checkpoint. But we still haven't seen a very strong case for the refactoring.

aakoshh · 2024-04-16T09:06:19Z

fendermint/vm/topdown/src/sync/syncer.rs

@@ -215,7 +221,7 @@ where

                    // Null block received, no block hash for the current height being polled.
                    // Return the previous parent hash as the non-null block hash.
-                    return Ok(parent_block_hash);
+                    return Ok((parent_block_hash, None));


Why not return parent_block_nonce to not lose the previous context?

aakoshh · 2024-04-16T09:09:31Z

ipc/cli/src/commands/crossmsg/mod.rs

@@ -52,5 +54,6 @@ pub(crate) enum Commands {
    PreRelease(PreReleaseArgs),
    Propagate(PropagateArgs),
    ListTopdownMsgs(ListTopdownMsgsArgs),
+    ListTopdownNonce(GetTopDownMsgNonceArgs),


Should this be GetTopdownNonce?

aakoshh · 2024-04-16T09:25:11Z

ipc/provider/src/manager/evm/manager.rs

+        let nonce = if !exists {
+            log::warn!("subnet does not exists at block hash, return 0 nonce");
+            0


Is there a particular reason why you opt to return 0 with a warning in the logs, instead of returning None or an Err? I looked at other cases in the file, and it looks in most other cases an error is returned:

genesis_epoch

get_applied_topdown_nonce,

prev_bottom_up_checkpoint_hash

get_block_hash

Because get_block_hash returns an Err, it means get_validator_changeset and get_topdown_msgs do so as well. You can argue that by the time you call this it will most certainly not return an error since it would have already failed on the previous two, but still, for consistency I think it should behave the same way.

raulk · 2024-05-07T15:51:50Z

Recording here that @cryptoAtwill encountered an unexpected roadblock because of this: filecoin-project/lotus#11205. It turns out that eth_call in Lotus behaves non-deterministically depending on how the node operator configured this env variable: LOTUS_SKIP_APPLY_TS_MESSAGE_CALL_WITH_GAS.

On Glif nodes, it's configured to skip the execution of requested block prior to applying the call, resulting in the call being executed on the prestate of the requested height, instead of the poststate (which is the general expectation). Therefore our diff-based consistency check does not pass when importing heights that include events at the end boundary.

This PR attempts to solve this: filecoin-project/lotus#11905, but it seems to be in some form of limbo.

aarshkshah1992 · 2024-05-10T05:00:14Z

@raulk That PR needs a review from someone who knows the ETH RPC stack really well but Steb is currently fully focused on F3. Would you or Mikers or Fridrick have the time to review it ?

check number of top down messages

7b0f247

cryptoAtwill requested review from raulk and aakoshh April 10, 2024 07:02

cryptoAtwill commented Apr 10, 2024

View reviewed changes

fendermint/vm/topdown/src/finality/fetch.rs Outdated Show resolved Hide resolved

fendermint/vm/topdown/src/finality/mod.rs Show resolved Hide resolved

cryptoAtwill marked this pull request as draft April 10, 2024 07:05

aakoshh reviewed Apr 10, 2024

View reviewed changes

fendermint/vm/topdown/src/sync/syncer.rs Outdated Show resolved Hide resolved

cryptoAtwill commented Apr 10, 2024

View reviewed changes

fendermint/vm/topdown/src/sync/syncer.rs Outdated Show resolved Hide resolved

aakoshh reviewed Apr 10, 2024

View reviewed changes

review feedbacks

c4a1616

cryptoAtwill marked this pull request as ready for review April 15, 2024 08:54

cryptoAtwill requested a review from aakoshh April 15, 2024 08:54

cryptoAtwill added 2 commits April 16, 2024 15:08

fix subnet does not exist error

bf4f112

add debug command

4110c41

aakoshh reviewed Apr 16, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Topdown consistency checks: Validate number of top down messages #873

Topdown consistency checks: Validate number of top down messages #873

cryptoAtwill commented Apr 10, 2024 •

edited

Loading

aakoshh Apr 10, 2024

cryptoAtwill Apr 15, 2024

aakoshh left a comment

cryptoAtwill commented Apr 15, 2024

aakoshh Apr 16, 2024

aakoshh Apr 16, 2024

aakoshh Apr 16, 2024

raulk commented May 7, 2024 •

edited

Loading

aarshkshah1992 commented May 10, 2024

Topdown consistency checks: Validate number of top down messages #873

Are you sure you want to change the base?

Topdown consistency checks: Validate number of top down messages #873

Conversation

cryptoAtwill commented Apr 10, 2024 • edited Loading

aakoshh Apr 10, 2024

Choose a reason for hiding this comment

cryptoAtwill Apr 15, 2024

Choose a reason for hiding this comment

aakoshh left a comment

Choose a reason for hiding this comment

cryptoAtwill commented Apr 15, 2024

aakoshh Apr 16, 2024

Choose a reason for hiding this comment

aakoshh Apr 16, 2024

Choose a reason for hiding this comment

aakoshh Apr 16, 2024

Choose a reason for hiding this comment

raulk commented May 7, 2024 • edited Loading

aarshkshah1992 commented May 10, 2024

cryptoAtwill commented Apr 10, 2024 •

edited

Loading

raulk commented May 7, 2024 •

edited

Loading