-
Notifications
You must be signed in to change notification settings - Fork 29
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
refactor: use batch confirmation that was upstreamed to eigenda client #192
refactor: use batch confirmation that was upstreamed to eigenda client #192
Conversation
e76a187
to
5b74b93
Compare
verify/cert.go
Outdated
// 1. verify batch is actually onchain at the batchMetadata's state confirmedBlockNumber | ||
// This assert is technically not necessary, but it's a good sanity check. | ||
// It could technically happen that a confirmed batch's block gets reorged out, | ||
// yet the tx is included in an earlier or later block, making this check fail... | ||
confirmationBlockNumber := batchMetadata.GetConfirmationBlockNumber() | ||
confirmationBlockNumberBigInt := big.NewInt(0).SetInt64(int64(confirmationBlockNumber)) | ||
_, err := cv.retrieveBatchMetadataHash(ctx, batchID, confirmationBlockNumberBigInt) | ||
if err != nil { | ||
return fmt.Errorf("failed to get context block: %w", err) | ||
return fmt.Errorf("batch not found onchain at supposedly confirmed block %d: %w", confirmationBlockNumber, err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this true? If so I think we should get rid of this check but document exactly why we are getting rid of it. Would have saved me a lot of time if this had been documented in the initial code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
need to review other PR, but this function seems useful to determine if it is EigenDA fault or ETH re-org
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting. But I'm not even sure, because the case I'm describing is this:
0. (assumption: we wait for 5 block confirmation)
- batcher confirms batch onchain at block 1000
- eigenda-client wait for 5 blocks and returns a BlobStatus
- proxy receives this one block later at block 1006, and the batch has actually been reorged and included in block 1001 instead.
- In this case the check highlighted here would fail, even though the block is actually confirmed with depth 5 (the next assert below around line 70 would catch if that was not the case)
For this reason I think this check is actually wrong here, and I still think we should remove this check. Does this make sense?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
EigenDA batcher detects and updates confirmation number, if it detects reorg, https://github.com/Layr-Labs/eigenda/blob/master/disperser/batcher/finalizer.go#L198
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
verify/cert.go
Outdated
// 1. verify batch is actually onchain at the batchMetadata's state confirmedBlockNumber | ||
// This assert is technically not necessary, but it's a good sanity check. | ||
// It could technically happen that a confirmed batch's block gets reorged out, | ||
// yet the tx is included in an earlier or later block, making this check fail... | ||
confirmationBlockNumber := batchMetadata.GetConfirmationBlockNumber() | ||
confirmationBlockNumberBigInt := big.NewInt(0).SetInt64(int64(confirmationBlockNumber)) | ||
_, err := cv.retrieveBatchMetadataHash(ctx, batchID, confirmationBlockNumberBigInt) | ||
if err != nil { | ||
return fmt.Errorf("failed to get context block: %w", err) | ||
return fmt.Errorf("batch not found onchain at supposedly confirmed block %d: %w", confirmationBlockNumber, err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
need to review other PR, but this function seems useful to determine if it is EigenDA fault or ETH re-org
verify/cert.go
Outdated
expectedHash, err := cv.manager.BatchIdToBatchMetadataHash(&bind.CallOpts{BlockNumber: blockNumber}, id) | ||
// 2. verify that the confirmation status has been reached | ||
// We retry 5 times waiting 12 seconds (block time) between each retry, | ||
// in case our eth node is behind that of the eigenda_client's node that deemed the batch confirmed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why would we ever use two nodes? or is the idea to have some resiliency to transient failures where the provider could be unreliable due to poor configuration?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Both eigenda-client and proxy dial the endpoint individually, establishing independent connections. If the nodes are behind a load-balancer, then both connections could actually be connected to different eth-nodes, its very frequent that rpc provider nodes are not in sync (bit me very hard once) for some dumb reason (web3 standards...).
Even if we injected the dialed connection from the proxy into the eigenda-client, the underlying tcp connection could drop in between the eigenda-client returning and proxy making its call, which would reconnect to another node, with some effect.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
makes sense and was thinking that too - got burnt in the past using a node cluster with an ALB ingress where subsequent RPC queries could fail due to nodes being out of sync. Ik OP clusters typically use proxyd to mitigate this from happening. from what I remember the connection would be held with the ALB directly with different nodes being queried per each new request on the client --> ALB connection. Feel free to resolve this comment!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually on second thought - isn't this a bit naive when we have access to the L1ReferenceBlock
number at which the batch should've confirmed? Also we're adding a 12 second cool-off which is aligned with the block production interval on Ethereum but not syncing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
proxyd doesn't seem to solve this problem, they say: "we consider healthy is... latest block lag ≤ configurable threshold". Way to solve this is to use sticky sessions.
Not sure I understand your last comment. Is #192 (comment) related to what you are talking about?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Talked offline. Let's update to retrying every 2 seconds for the same amount of time (60 seconds).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@epociask I'll let you review my answers before rebasing and fixing conflicts, so that the commit links still work. |
verify/cert.go
Outdated
expectedHash, err := cv.manager.BatchIdToBatchMetadataHash(&bind.CallOpts{BlockNumber: blockNumber}, id) | ||
// 2. verify that the confirmation status has been reached | ||
// We retry 5 times waiting 12 seconds (block time) between each retry, | ||
// in case our eth node is behind that of the eigenda_client's node that deemed the batch confirmed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
makes sense and was thinking that too - got burnt in the past using a node cluster with an ALB ingress where subsequent RPC queries could fail due to nodes being out of sync. Ik OP clusters typically use proxyd to mitigate this from happening. from what I remember the connection would be held with the ALB directly with different nodes being queried per each new request on the client --> ALB connection. Feel free to resolve this comment!
…tion depth guarantee flags: add new eigenda-client flags for blob confirmation depth also pass those flags to verifier (until we also upstream verification to the eda client) comment: was pointing to wrong eigenda-client repo in TODO comment fix: go.mod to point to PR commit instead of using local replace directive chore: go mod tidy to generate go.sum chore: use proto Getter functions instead of fields (that are potentially nil) ci: upgrade golangci-lint version 1.60->1.61 fix: verifySecurityParams func arguments after rebase chore: make more robust verifyBatchConfirmedOnchain logic Added retry logic and better comments style: Onchain -> OnChain docs: better comment describing eth_getBlockByNumber call args style: better error msg when memstore enabled but cert verification is not fix: verifier.WaitForFinalization was not set fix(flags): deleted deprecated flags that had same name as new ones in other package, causing panic style(flags): merged WaitForFinalizationFlagName into ConfirmationDepth flag It now accepts uint conf depth or 'finalized' string now chore: remove unused utils.EqualBytes function (same as stdlib exp function anyways) chore: remove log line added for debugging
de950b2
to
ba94b43
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
This PR depends on Layr-Labs/eigenda#821 being merged first.
Main simplification that this allows is getting rid of the very complex retry logic that we had where an inner functionw as returning an error that was being caught by a very outward for loop, which was hard to understand. This PR makes that logic more local, but I opted to still keep the check in eigenda-proxy as well as an assert/safety guarantee. Think we also need it in the Get route... but not super familiar with that code path cc @epociask to make sure logic in this PR still works for that.
Fixes Issue
Fixes #
Changes proposed
Screenshots (Optional)
Note to reviewers