Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refactor: use batch confirmation that was upstreamed to eigenda client #192

Merged

Conversation

samlaf
Copy link
Collaborator

@samlaf samlaf commented Oct 25, 2024

This PR depends on Layr-Labs/eigenda#821 being merged first.

Main simplification that this allows is getting rid of the very complex retry logic that we had where an inner functionw as returning an error that was being caught by a very outward for loop, which was hard to understand. This PR makes that logic more local, but I opted to still keep the check in eigenda-proxy as well as an assert/safety guarantee. Think we also need it in the Get route... but not super familiar with that code path cc @epociask to make sure logic in this PR still works for that.

Fixes Issue

Fixes #

Changes proposed

Screenshots (Optional)

Note to reviewers

@samlaf samlaf requested review from bxue-l2 and epociask October 25, 2024 11:27
go.mod Outdated Show resolved Hide resolved
@samlaf samlaf force-pushed the samlaf/refactor--upstream-cert-verification-to-eigenda-client branch from e76a187 to 5b74b93 Compare October 25, 2024 12:19
verify/cert.go Outdated
Comment on lines 59 to 67
// 1. verify batch is actually onchain at the batchMetadata's state confirmedBlockNumber
// This assert is technically not necessary, but it's a good sanity check.
// It could technically happen that a confirmed batch's block gets reorged out,
// yet the tx is included in an earlier or later block, making this check fail...
confirmationBlockNumber := batchMetadata.GetConfirmationBlockNumber()
confirmationBlockNumberBigInt := big.NewInt(0).SetInt64(int64(confirmationBlockNumber))
_, err := cv.retrieveBatchMetadataHash(ctx, batchID, confirmationBlockNumberBigInt)
if err != nil {
return fmt.Errorf("failed to get context block: %w", err)
return fmt.Errorf("batch not found onchain at supposedly confirmed block %d: %w", confirmationBlockNumber, err)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this true? If so I think we should get rid of this check but document exactly why we are getting rid of it. Would have saved me a lot of time if this had been documented in the initial code.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

need to review other PR, but this function seems useful to determine if it is EigenDA fault or ETH re-org

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting. But I'm not even sure, because the case I'm describing is this:
0. (assumption: we wait for 5 block confirmation)

  1. batcher confirms batch onchain at block 1000
  2. eigenda-client wait for 5 blocks and returns a BlobStatus
  3. proxy receives this one block later at block 1006, and the batch has actually been reorged and included in block 1001 instead.
  4. In this case the check highlighted here would fail, even though the block is actually confirmed with depth 5 (the next assert below around line 70 would catch if that was not the case)

For this reason I think this check is actually wrong here, and I still think we should remove this check. Does this make sense?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

EigenDA batcher detects and updates confirmation number, if it detects reorg, https://github.com/Layr-Labs/eigenda/blob/master/disperser/batcher/finalizer.go#L198

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bxue-l2
bxue-l2 previously approved these changes Oct 25, 2024
verify/cert.go Outdated
Comment on lines 59 to 67
// 1. verify batch is actually onchain at the batchMetadata's state confirmedBlockNumber
// This assert is technically not necessary, but it's a good sanity check.
// It could technically happen that a confirmed batch's block gets reorged out,
// yet the tx is included in an earlier or later block, making this check fail...
confirmationBlockNumber := batchMetadata.GetConfirmationBlockNumber()
confirmationBlockNumberBigInt := big.NewInt(0).SetInt64(int64(confirmationBlockNumber))
_, err := cv.retrieveBatchMetadataHash(ctx, batchID, confirmationBlockNumberBigInt)
if err != nil {
return fmt.Errorf("failed to get context block: %w", err)
return fmt.Errorf("batch not found onchain at supposedly confirmed block %d: %w", confirmationBlockNumber, err)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

need to review other PR, but this function seems useful to determine if it is EigenDA fault or ETH re-org

@bxue-l2 bxue-l2 dismissed their stale review October 25, 2024 17:57

wait for ethen's review

flags/eigendaflags/cli.go Outdated Show resolved Hide resolved
verify/cert.go Outdated Show resolved Hide resolved
verify/cert.go Outdated
expectedHash, err := cv.manager.BatchIdToBatchMetadataHash(&bind.CallOpts{BlockNumber: blockNumber}, id)
// 2. verify that the confirmation status has been reached
// We retry 5 times waiting 12 seconds (block time) between each retry,
// in case our eth node is behind that of the eigenda_client's node that deemed the batch confirmed
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why would we ever use two nodes? or is the idea to have some resiliency to transient failures where the provider could be unreliable due to poor configuration?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both eigenda-client and proxy dial the endpoint individually, establishing independent connections. If the nodes are behind a load-balancer, then both connections could actually be connected to different eth-nodes, its very frequent that rpc provider nodes are not in sync (bit me very hard once) for some dumb reason (web3 standards...).

Even if we injected the dialed connection from the proxy into the eigenda-client, the underlying tcp connection could drop in between the eigenda-client returning and proxy making its call, which would reconnect to another node, with some effect.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

makes sense and was thinking that too - got burnt in the past using a node cluster with an ALB ingress where subsequent RPC queries could fail due to nodes being out of sync. Ik OP clusters typically use proxyd to mitigate this from happening. from what I remember the connection would be held with the ALB directly with different nodes being queried per each new request on the client --> ALB connection. Feel free to resolve this comment!

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually on second thought - isn't this a bit naive when we have access to the L1ReferenceBlock number at which the batch should've confirmed? Also we're adding a 12 second cool-off which is aligned with the block production interval on Ethereum but not syncing

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

proxyd doesn't seem to solve this problem, they say: "we consider healthy is... latest block lag ≤ configurable threshold". Way to solve this is to use sticky sessions.

Not sure I understand your last comment. Is #192 (comment) related to what you are talking about?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Talked offline. Let's update to retrying every 2 seconds for the same amount of time (60 seconds).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

verify/cert.go Show resolved Hide resolved
verify/cert.go Show resolved Hide resolved
verify/cert.go Show resolved Hide resolved
@samlaf samlaf requested review from epociask and bxue-l2 October 27, 2024 18:30
@samlaf
Copy link
Collaborator Author

samlaf commented Oct 27, 2024

@epociask I'll let you review my answers before rebasing and fixing conflicts, so that the commit links still work.

flags/eigendaflags/cli.go Outdated Show resolved Hide resolved
flags/eigendaflags/cli.go Outdated Show resolved Hide resolved
verify/cert.go Outdated
expectedHash, err := cv.manager.BatchIdToBatchMetadataHash(&bind.CallOpts{BlockNumber: blockNumber}, id)
// 2. verify that the confirmation status has been reached
// We retry 5 times waiting 12 seconds (block time) between each retry,
// in case our eth node is behind that of the eigenda_client's node that deemed the batch confirmed
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

makes sense and was thinking that too - got burnt in the past using a node cluster with an ALB ingress where subsequent RPC queries could fail due to nodes being out of sync. Ik OP clusters typically use proxyd to mitigate this from happening. from what I remember the connection would be held with the ALB directly with different nodes being queried per each new request on the client --> ALB connection. Feel free to resolve this comment!

verify/cert.go Show resolved Hide resolved
…tion depth guarantee

flags: add new eigenda-client flags for blob confirmation depth

also pass those flags to verifier (until we also upstream verification to the eda client)

comment: was pointing to wrong eigenda-client repo in TODO comment

fix: go.mod to point to PR commit instead of using local replace directive

chore: go mod tidy to generate go.sum

chore: use proto Getter functions instead of fields (that are potentially nil)

ci: upgrade golangci-lint version 1.60->1.61

fix: verifySecurityParams func arguments after rebase

chore: make more robust verifyBatchConfirmedOnchain logic

Added retry logic and better comments

style: Onchain -> OnChain

docs: better comment describing eth_getBlockByNumber call args

style: better error msg when memstore enabled but cert verification is not

fix: verifier.WaitForFinalization was not set

fix(flags): deleted deprecated flags that had same name as new ones in other package, causing panic

style(flags): merged WaitForFinalizationFlagName into ConfirmationDepth flag

It now accepts uint conf depth or 'finalized' string now

chore: remove unused utils.EqualBytes function (same as stdlib exp function anyways)

chore: remove log line added for debugging
@samlaf samlaf force-pushed the samlaf/refactor--upstream-cert-verification-to-eigenda-client branch from de950b2 to ba94b43 Compare October 28, 2024 15:46
@samlaf samlaf changed the title refactor: to use cert verification that was upstreamed to eigenda client refactor: use batch confirmation that was upstreamed to eigenda client Oct 28, 2024
@samlaf samlaf requested a review from epociask October 28, 2024 23:54
Copy link
Collaborator

@epociask epociask left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@samlaf samlaf merged commit c6ab8fb into main Oct 29, 2024
7 checks passed
@samlaf samlaf deleted the samlaf/refactor--upstream-cert-verification-to-eigenda-client branch October 29, 2024 15:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants