-
Notifications
You must be signed in to change notification settings - Fork 238
feat: High availabilty via RAFT #2836
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
The latest Buf updates on your PR. Results from workflow CI / buf-check (pull_request).
|
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #2836 +/- ##
==========================================
- Coverage 64.76% 62.15% -2.61%
==========================================
Files 81 88 +7
Lines 7350 8065 +715
==========================================
+ Hits 4760 5013 +253
- Misses 2049 2475 +426
- Partials 541 577 +36
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
* main: build(deps): Bump the go_modules group across 2 directories with 3 updates (#2846) build(deps): Bump github.com/dvsekhvalnov/jose2go from 1.7.0 to 1.8.0 in /test/e2e (#2851) build(deps): Bump github.com/consensys/gnark-crypto from 0.18.0 to 0.18.1 in /test/e2e (#2844) build(deps): Bump github.com/cometbft/cometbft from 0.38.17 to 0.38.19 in /test/e2e (#2843) build(deps): Bump github.com/dvsekhvalnov/jose2go from 1.6.0 to 1.7.0 in /test/e2e (#2845)
(cherry picked from commit c44cd77e665f6d5d463295c6ed61c59a56d88db3)
|
@claude review again |
|
Claude finished @alpe's task —— View job Comprehensive Code Review: RAFT-Based High AvailabilityI've completed a thorough review of the RAFT implementation for high availability. This is a substantial and well-architected feature. Below is my detailed feedback organized by priority and category. 🔴 Critical Issues1. JSON Serialization Performance Bottleneck (pkg/raft/node.go:211)Issue: Using JSON for RAFT log replication at 100ms block intervals creates unnecessary overhead. data, err := json.Marshal(state) // todo:use protobufImpact: At 10 blocks/second, JSON marshaling/unmarshaling adds measurable latency. This will compound under load. 2. Race Condition Risk in Leadership Transfer (pkg/raft/election.go:96-104)Issue: The leadership transfer logic has a potential race condition: raftSynced := d.node.waitForMsgsLanded(d.node.Config().SendTimeout) == nil
if !raftSynced || !runnable.IsSynced(d.node.GetState()) {
if err := d.node.leadershipTransfer(); err != nil && !errors.Is(err, raft.ErrNotLeader) {
return err
}
continue
}Problem: Between checking 3. Store Sync Without Error Handling Context (block/internal/executing/executor.go:461-463)Issue: if err := e.store.Sync(context.Background()); err != nil {
return fmt.Errorf("failed to sync store: %w", err)
}Problem: Using 4. Missing Validation in FSM.Apply (pkg/raft/node.go:285-287)Issue: The validation only checks basic monotonicity: if err := f.state.Load().assertValid(state); err != nil {
return err
}Problem: Looking at
🟡 High Priority Issues5. Channel Buffer Size Risk (block/internal/syncing/raft_retriever.go:58)Issue: applyCh := make(chan raft.RaftApplyMsg, 1)Problem: Buffer of 1 is very small. If RAFT applies logs faster than the syncer can process (e.g., during catch-up after network partition), this will block RAFT FSM.Apply.
6. Incomplete isCatchingUpState Implementation (block/internal/syncing/syncer.go:171-183)Issue: Per @julienrbrt's comment, this doesn't capture DA catch-up state: return s.headerStore.Store().Height() > currentHeight ||
s.dataStore.Store().Height() > currentHeight ||
s.raftRetriever != nil && s.raftRetriever.Height() > currentHeightProblem: If the node is catching up from DA but p2p/raft are synced, it will incorrectly report as synced. 7. No Cleanup on Apply Callback Errors (pkg/raft/node.go:279-300)Issue: FSM.Apply returns errors but doesn't have cleanup/rollback logic: if err := json.Unmarshal(log.Data, &state); err != nil {
f.logger.Error().Err(err).Msg("unmarshal block state")
return err
}Problem: If unmarshal fails, the log is applied in RAFT but not in FSM state. This creates inconsistency. 8. Follower Mode Doesn't Validate Block Signatures (block/internal/syncing/raft_retriever.go:99-128)Issue: if err := header.Header.ValidateBasic(); err != nil {
r.logger.Debug().Err(err).Msg("invalid header structure")
return nil // Silent skip
}Problem: Only validates structure and proposer address, not cryptographic signature. Followers trust the leader without verification. 🟢 Medium Priority Issues9. Non-Atomic Mode Switching (node/failover.go:177-231)Issue: The if err := f.p2pClient.Start(ctx); err != nil {
return fmt.Errorf("start p2p: %w", err)
}
defer f.p2pClient.Close()
// ... more starts with defersProblem: If startup fails midway, cleanup order may not match startup order. Defers run in reverse, but some services may have dependencies. 10. P2P Peer Clearing Side Effect (node/full.go:100)Issue: nodeConfig.P2P.Peers = "" // peers are not supported in aggregator modeProblem: This mutates shared config that may be used elsewhere. Side effects on config structs can cause subtle bugs. 11. Missing Metrics for RAFT OperationsObservation: No custom metrics for:
12. waitForMsgsLanded Polling Efficiency (pkg/raft/node.go:148-167)Issue: ticker := time.NewTicker(min(n.config.SendTimeout, timeout) / 2)
for {
select {
case <-ticker.C:
if n.raft.AppliedIndex() >= n.raft.LastIndex() {
return nil
}Problem: Polling with ticker when RAFT likely has better mechanisms for notification. 🔵 Code Quality & Best Practices13. Inconsistent Nil ChecksSeveral functions start with: if n == nil {
return nil
}This pattern appears in:
Issue: Calling methods on 14. Error Message Consistency
Recommendation: Standardize error messages. Use sentence case and consistent terminology. 15. Magic Numbers in Configuration
Recommendation: Move to named constants or configuration parameters with explanatory comments. 16. Duplicate Code in Failover Factories (node/failover.go:40-107)The 17. Verbose Deferred Error HandlingPattern appears frequently: defer func() {
if err := f.bc.Stop(); err != nil && !errors.Is(err, context.Canceled) {
rerr = errors.Join(rerr, fmt.Errorf("stopping block components: %w", err))
}
}()Recommendation: Consider a helper function 📋 Testing & Coverage Observations18. Coverage Drop is Concerning
Critical Gap: The RAFT retriever has zero coverage. This is a critical component that handles consensus data. Recommendation:
19. Test Coverage for Race ConditionsThe election logic has complex concurrent state transitions but tests don't cover:
Recommendation: Add fuzzing or stress tests with 🎯 Architecture & Design20. Excellent Separation of Concerns✅ The separation between 21. Good Use of Atomic State✅ 22. RAFT Integration Strategy✅ Using hashicorp/raft is solid - production-proven library. 23. Failover Design QuestionThe current design uses RAFT for leader election but still broadcasts to P2P. Question: Why maintain P2P broadcast from leader to followers if RAFT is already replicating state? 🔒 Security Review24. Limited Byzantine Fault ToleranceRAFT provides crash fault tolerance (CFT), not Byzantine fault tolerance (BFT). If a leader is compromised and signs invalid blocks:
Mitigation: The block signature verification in the executor provides some protection, but followers in RAFT mode don't verify. 25. No TLS on RAFT Transporttransport, err := raft.NewTCPTransport(cfg.RaftAddr, addr, 3, 10*time.Second, os.Stderr)Issue: RAFT traffic is unencrypted. Leader-follower communication includes block data and consensus messages. 📝 Documentation Gaps26. Missing ADR ReferenceThe PR description mentions 27. Configuration DocumentationThe
28. Operational Runbook NeededThis introduces complex operational scenarios:
Recommendation: Create operational documentation before production use. ✅ What's Done Well
🎬 Final RecommendationsPriority Order:
Overall AssessmentThis is a well-architected implementation of RAFT-based HA. The core design is sound, but there are critical issues around synchronization safety and performance that must be addressed before production use. The test coverage gap is concerning for such a critical feature. The 41% patch coverage is insufficient for consensus code. I strongly recommend increasing coverage to at least 80% for |
julienrbrt
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
first walkthrough, I'll go read about github.com/hashicorp/raft and come back to review election.go and node.go
| return nil | ||
| } | ||
|
|
||
| // Height returns the current height stored |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need to know the height of the p2p (go header) store? (I am still reviewing, this may get clearer). We can have the app height from the evolve store.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When the node switches from sync to aggregator mode, the internal state is key to prevent double signing.
The Syncer has a isCatchingUpState method now that checks the stores for any height > current.
it is called within the leader election loop to transfer leadership in case it is not fully synced, yet.
| } | ||
|
|
||
| // SetApplyCallback sets a callback to be called when log entries are applied | ||
| func (n *Node) SetApplyCallback(ch chan<- RaftApplyMsg) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: what is this for? the go doc is very light
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The channel is passed by the syncer to receive first level state updates from within the raft cluster. This should be the fastest communication channel available.
| }() | ||
|
|
||
| // Check raft leadership if raft is enabled | ||
| if e.raftNode != nil && !e.raftNode.IsLeader() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
unrelated: i wonder how this will play with different sequencers.
In #2797 you can get to that path without node key (to sign). I suppose we'll need to add a condition for based sequencing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, I was only preparing for single sequencer. Base would not work with raft as there are no aggregators.
| leaderFactory := func() (raftpkg.Runnable, error) { | ||
| logger.Info().Msg("Starting aggregator-MODE") | ||
| nodeConfig.Node.Aggregator = true | ||
| nodeConfig.P2P.Peers = "" // peers are not supported in aggregator mode |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not sure I understand this. is the aggregator broadcasting to no one?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the aggregator is required to broadcast to at least one node part of a larger mesh other wise p2p will not work
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is more who calls whom. The aggregator gets called not otherwise. Starting all nodes with p2p-peer setup makes sense though. When a ha cluster is setup, the raft leader gets the aggregator role and I clear the peers when the p2p stack is restarted.
There is an error thrown somewhere when peers are not empty.
node/full.go
Outdated
| func initRaftNode(nodeConfig config.Config, logger zerolog.Logger) (*raftpkg.Node, error) { | ||
| raftDir := nodeConfig.Raft.RaftDir | ||
| if raftDir == "" { | ||
| raftDir = filepath.Join(nodeConfig.RootDir, "raft") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: we should be using DefaultConfig() value if empty.
| bc *block.Components | ||
| } | ||
|
|
||
| func newSyncMode( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: i was a tiny bit confused this was moved here instead of full.go
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are the constructors. Naming could be better, I guess.
| } | ||
| return setupFailoverState(nodeConfig, nodeKey, database, genesis, logger, mainKV, rktStore, blockComponentsFn, raftNode) | ||
| } | ||
| func newAggregatorMode( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto
| return fmt.Errorf("not leader") | ||
| } | ||
|
|
||
| data, err := json.Marshal(state) // todo:use protobuf |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why the todo? size?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should migrate to protobuf here. json will cause overhead, at 100ms we need to minimise it as much as possible
* main: chore: fix some comments (#2874) chore: bump node in evm-single (#2875) refactor(syncer,cache): use compare and swap loop and add comments (#2873) refactor: use state da height as well (#2872) refactor: retrieve highest da height in cache (#2870) chore: change from event count to start and end height (#2871)
## Overview Speed up cache write/loads via parallel execution. Pulled from #2836
## Overview Minor updates to make it easier to trace errors Extracted from #2836
* main: chore: remove extra github action yml file (#2882) fix(execution/evm): verify payload status (#2863) feat: fetch included da height from store (#2880) chore: better output on errors (#2879) refactor!: create da client and split cache interface (#2878) chore!: rename `evm-single` and `grpc-single` (#2839) build(deps): Bump golang.org/x/crypto from 0.42.0 to 0.45.0 in /tools/da-debug in the go_modules group across 1 directory (#2876) chore: parallel cache de/serialization (#2868) chore: bump blob size (#2877)
Implement failover via RAFT