Skip to content

Conversation

@alpe
Copy link
Contributor

@alpe alpe commented Nov 12, 2025

Implement failover via RAFT

  • Improve Cache startup/shutdown with parallelization
  • Publish to RAFT cluster in executor
  • Sync DB after each block created in executor
  • Add new RaftReceiver to sync when in aggregator follower mode
  • Introduce failoverState to switch between follower/leader mode
  • Provide RAFT node details via http endpoint

@github-actions
Copy link
Contributor

github-actions bot commented Nov 12, 2025

The latest Buf updates on your PR. Results from workflow CI / buf-check (pull_request).

BuildFormatLintBreakingUpdated (UTC)
✅ passed⏩ skipped✅ passed✅ passedNov 24, 2025, 4:28 PM

@codecov
Copy link

codecov bot commented Nov 12, 2025

Codecov Report

❌ Patch coverage is 41.34831% with 522 lines in your changes missing coverage. Please review.
✅ Project coverage is 62.15%. Comparing base (8cd0fb8) to head (50d7caf).

Files with missing lines Patch % Lines
pkg/raft/node.go 12.50% 168 Missing ⚠️
pkg/raft/node_mock.go 45.08% 74 Missing and 21 partials ⚠️
block/internal/syncing/raft_retriever.go 0.00% 60 Missing ⚠️
node/full.go 32.81% 36 Missing and 7 partials ⚠️
node/failover.go 74.45% 22 Missing and 13 partials ⚠️
block/internal/syncing/syncer.go 28.88% 30 Missing and 2 partials ⚠️
block/internal/executing/executor.go 6.25% 25 Missing and 5 partials ⚠️
pkg/raft/election.go 79.26% 12 Missing and 5 partials ⚠️
pkg/rpc/server/http.go 6.66% 13 Missing and 1 partial ⚠️
block/internal/syncing/assert.go 57.89% 4 Missing and 4 partials ⚠️
... and 9 more
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #2836      +/-   ##
==========================================
- Coverage   64.76%   62.15%   -2.61%     
==========================================
  Files          81       88       +7     
  Lines        7350     8065     +715     
==========================================
+ Hits         4760     5013     +253     
- Misses       2049     2475     +426     
- Partials      541      577      +36     
Flag Coverage Δ
combined 62.15% <41.34%> (-2.61%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

alpe added 4 commits November 17, 2025 10:55
* main:
  fix: remove duplicate error logging in light node shutdown (#2841)
  chore: fix incorrect function name in comment (#2840)
  chore: remove sequencer go.mod (#2837)
* main:
  build(deps): Bump the go_modules group across 2 directories with 3 updates (#2846)
  build(deps): Bump github.com/dvsekhvalnov/jose2go from 1.7.0 to 1.8.0 in /test/e2e (#2851)
  build(deps): Bump github.com/consensys/gnark-crypto from 0.18.0 to 0.18.1 in /test/e2e (#2844)
  build(deps): Bump github.com/cometbft/cometbft from 0.38.17 to 0.38.19 in /test/e2e (#2843)
  build(deps): Bump github.com/dvsekhvalnov/jose2go from 1.6.0 to 1.7.0 in /test/e2e (#2845)
@alpe alpe changed the title [WIP] feat: High availabilty via RAFT feat: High availabilty via RAFT Nov 17, 2025
alpe added 4 commits November 18, 2025 11:00
(cherry picked from commit c44cd77e665f6d5d463295c6ed61c59a56d88db3)
@alpe alpe marked this pull request as ready for review November 19, 2025 09:54
@alpe
Copy link
Contributor Author

alpe commented Nov 19, 2025

@claude review again

@claude
Copy link
Contributor

claude bot commented Nov 19, 2025

Claude finished @alpe's task —— View job


Comprehensive Code Review: RAFT-Based High Availability

I've completed a thorough review of the RAFT implementation for high availability. This is a substantial and well-architected feature. Below is my detailed feedback organized by priority and category.


🔴 Critical Issues

1. JSON Serialization Performance Bottleneck (pkg/raft/node.go:211)

Issue: Using JSON for RAFT log replication at 100ms block intervals creates unnecessary overhead.

data, err := json.Marshal(state) // todo:use protobuf

Impact: At 10 blocks/second, JSON marshaling/unmarshaling adds measurable latency. This will compound under load.
Recommendation: Migrate to protobuf immediately. The TODO comment acknowledges this, but it's critical for production. JSON can add 2-5x overhead vs protobuf for binary data.
As noted by @tac0turtle: This should be prioritized before launch.

2. Race Condition Risk in Leadership Transfer (pkg/raft/election.go:96-104)

Issue: The leadership transfer logic has a potential race condition:

raftSynced := d.node.waitForMsgsLanded(d.node.Config().SendTimeout) == nil
if !raftSynced || !runnable.IsSynced(d.node.GetState()) {
    if err := d.node.leadershipTransfer(); err != nil && !errors.Is(err, raft.ErrNotLeader) {
        return err
    }
    continue
}

Problem: Between checking IsSynced() and starting leader operations, new blocks could arrive via P2P/DA that aren't yet in RAFT state. This could lead to double-signing on stale state.
Recommendation: Add a final sync check immediately before starting leader operations, or use a sync barrier pattern.

3. Store Sync Without Error Handling Context (block/internal/executing/executor.go:461-463)

Issue:

if err := e.store.Sync(context.Background()); err != nil {
    return fmt.Errorf("failed to sync store: %w", err)
}

Problem: Using context.Background() ignores parent context cancellation. If the executor is shutting down, we could block indefinitely.
Recommendation: Use the executor's context: e.ctx or pass through the parent context.

4. Missing Validation in FSM.Apply (pkg/raft/node.go:285-287)

Issue: The validation only checks basic monotonicity:

if err := f.state.Load().assertValid(state); err != nil {
    return err
}

Problem: Looking at types.go:15-23, assertValid only checks height/timestamp ordering, not chain continuity (previous hash, state root, etc.).
Recommendation: Add validation for:

  • Block hash matches header hash
  • Previous block hash linkage
  • State root continuity
  • Signature verification (or document why this is skipped)

🟡 High Priority Issues

5. Channel Buffer Size Risk (block/internal/syncing/raft_retriever.go:58)

Issue:

applyCh := make(chan raft.RaftApplyMsg, 1)

Problem: Buffer of 1 is very small. If RAFT applies logs faster than the syncer can process (e.g., during catch-up after network partition), this will block RAFT FSM.Apply.
Recommendation: Either:

  • Increase buffer to 100+ to handle burst scenarios
  • Document the performance implications of buffer=1
  • Add metrics to track channel saturation

6. Incomplete isCatchingUpState Implementation (block/internal/syncing/syncer.go:171-183)

Issue: Per @julienrbrt's comment, this doesn't capture DA catch-up state:

return s.headerStore.Store().Height() > currentHeight ||
    s.dataStore.Store().Height() > currentHeight ||
    s.raftRetriever != nil && s.raftRetriever.Height() > currentHeight

Problem: If the node is catching up from DA but p2p/raft are synced, it will incorrectly report as synced.
Response to @alpe: While DA is slowest, nodes can receive blocks via P2P faster than DA. The sync state should be comprehensive.
Recommendation: Add DA height check or clearly document the assumption that RAFT nodes don't depend on DA for sync.

7. No Cleanup on Apply Callback Errors (pkg/raft/node.go:279-300)

Issue: FSM.Apply returns errors but doesn't have cleanup/rollback logic:

if err := json.Unmarshal(log.Data, &state); err != nil {
    f.logger.Error().Err(err).Msg("unmarshal block state")
    return err
}

Problem: If unmarshal fails, the log is applied in RAFT but not in FSM state. This creates inconsistency.
Recommendation: Document RAFT's error handling expectations. Consider if this should panic instead (RAFT expects FSM to be deterministic).

8. Follower Mode Doesn't Validate Block Signatures (block/internal/syncing/raft_retriever.go:99-128)

Issue:

if err := header.Header.ValidateBasic(); err != nil {
    r.logger.Debug().Err(err).Msg("invalid header structure")
    return nil  // Silent skip
}

Problem: Only validates structure and proposer address, not cryptographic signature. Followers trust the leader without verification.
Security Risk: A compromised/buggy leader could propagate invalid blocks.
Recommendation: Add signature verification or document the trust model explicitly.


🟢 Medium Priority Issues

9. Non-Atomic Mode Switching (node/failover.go:177-231)

Issue: The failoverState.Run starts multiple services sequentially with deferred cleanup:

if err := f.p2pClient.Start(ctx); err != nil {
    return fmt.Errorf("start p2p: %w", err)
}
defer f.p2pClient.Close()
// ... more starts with defers

Problem: If startup fails midway, cleanup order may not match startup order. Defers run in reverse, but some services may have dependencies.
Recommendation: Use errgroup for parallel startup and explicit cleanup order on error.

10. P2P Peer Clearing Side Effect (node/full.go:100)

Issue:

nodeConfig.P2P.Peers = "" // peers are not supported in aggregator mode

Problem: This mutates shared config that may be used elsewhere. Side effects on config structs can cause subtle bugs.
Recommendation: Create a copy of the config or use a dedicated flag for peer mode.

11. Missing Metrics for RAFT Operations

Observation: No custom metrics for:

  • RAFT leader elections
  • Leadership transfer events
  • FSM apply latency
  • RAFT log size/lag
  • Failover events
    Recommendation: Add metrics for operational visibility. RAFT-based systems need monitoring for split-brain detection and performance issues.

12. waitForMsgsLanded Polling Efficiency (pkg/raft/node.go:148-167)

Issue:

ticker := time.NewTicker(min(n.config.SendTimeout, timeout) / 2)
for {
    select {
    case <-ticker.C:
        if n.raft.AppliedIndex() >= n.raft.LastIndex() {
            return nil
        }

Problem: Polling with ticker when RAFT likely has better mechanisms for notification.
Recommendation: Investigate if hashicorp/raft provides a channel or callback for AppliedIndex updates to avoid polling.


🔵 Code Quality & Best Practices

13. Inconsistent Nil Checks

Several functions start with:

if n == nil {
    return nil
}

This pattern appears in:

  • node.go:108-110, 149-151, 169-172, 177-180

Issue: Calling methods on nil receivers is valid in Go but often indicates a design smell. Better to panic early or return errors.
Recommendation: Document why nil nodes are allowed, or remove this pattern and require non-nil invariant.

14. Error Message Consistency

  • node.go:331: "not raft leader" (lowercase, no article)
  • election.go:100: "became leader, but not synced. Pass on leadership" (informal tone)

Recommendation: Standardize error messages. Use sentence case and consistent terminology.

15. Magic Numbers in Configuration

  • election.go:83: ticker := time.NewTicker(300 * time.Millisecond)
  • failover.go:158-160: ReadTimeout: 10 * time.Second, WriteTimeout: 10 * time.Second, IdleTimeout: 120 * time.Second
  • node.go:58: raftConfig.LeaderLeaseTimeout = cfg.HeartbeatTimeout / 2

Recommendation: Move to named constants or configuration parameters with explanatory comments.

16. Duplicate Code in Failover Factories (node/failover.go:40-107)

The newSyncMode and newAggregatorMode functions are nearly identical with only the component builder differing.
Recommendation: Extract common logic into a helper function to reduce duplication.

17. Verbose Deferred Error Handling

Pattern appears frequently:

defer func() {
    if err := f.bc.Stop(); err != nil && !errors.Is(err, context.Canceled) {
        rerr = errors.Join(rerr, fmt.Errorf("stopping block components: %w", err))
    }
}()

Recommendation: Consider a helper function deferStop to reduce verbosity.


📋 Testing & Coverage Observations

18. Coverage Drop is Concerning

  • Overall coverage: 64.76% → 62.24% (-2.52%)
  • Patch coverage: 41.34% (522 lines missing)
  • Core files with low coverage:
    • pkg/raft/node.go: 12.50% (168 missing)
    • block/internal/syncing/raft_retriever.go: 0.00% (60 missing)
    • node/failover.go: 74.45% (35 missing)

Critical Gap: The RAFT retriever has zero coverage. This is a critical component that handles consensus data.

Recommendation:

  • Add unit tests for raft_retriever.go covering:
    • Invalid block handling
    • Concurrent apply operations
    • Stop/restart scenarios
  • Add integration tests for:
    • Leader failover with transactions in flight
    • Network partition scenarios
    • Split-brain prevention
  • The E2E test TestLeaseFailoverE2E is good but insufficient alone

19. Test Coverage for Race Conditions

The election logic has complex concurrent state transitions but tests don't cover:

  • Rapid leader/follower oscillation
  • Multiple leadership changes in quick succession
  • Context cancellation during state transition

Recommendation: Add fuzzing or stress tests with -race flag enabled.


🎯 Architecture & Design

20. Excellent Separation of Concerns

✅ The separation between DynamicLeaderElection, Node, and failoverState is clean.
✅ Using factory functions for leader/follower construction is elegant.
✅ The Runnable interface abstracts the mode nicely.

21. Good Use of Atomic State

FSM.state uses atomic.Pointer[RaftBlockState] correctly.
running flag properly synchronized.

22. RAFT Integration Strategy

✅ Using hashicorp/raft is solid - production-proven library.
✅ TCP transport with BoltDB backend is appropriate for this scale.
✅ Snapshot configuration is sensible.

23. Failover Design Question

The current design uses RAFT for leader election but still broadcasts to P2P.

Question: Why maintain P2P broadcast from leader to followers if RAFT is already replicating state?
Benefit: Redundancy and allows non-RAFT nodes to sync
Tradeoff: Double bandwidth for RAFT cluster members
Recommendation: Document this design decision. Consider allowing RAFT-only mode where followers don't need P2P.


🔒 Security Review

24. Limited Byzantine Fault Tolerance

RAFT provides crash fault tolerance (CFT), not Byzantine fault tolerance (BFT). If a leader is compromised and signs invalid blocks:

  • Followers will replicate invalid state
  • No proof-of-fraud mechanism exists

Mitigation: The block signature verification in the executor provides some protection, but followers in RAFT mode don't verify.
Recommendation: Document trust assumptions and consider if BFT consensus is needed for production.

25. No TLS on RAFT Transport

transport, err := raft.NewTCPTransport(cfg.RaftAddr, addr, 3, 10*time.Second, os.Stderr)

Issue: RAFT traffic is unencrypted. Leader-follower communication includes block data and consensus messages.
Recommendation: Add TLS configuration for production deployments.


📝 Documentation Gaps

26. Missing ADR Reference

The PR description mentions adr023_ha-failover but the ADR isn't linked or included.
Recommendation: Ensure ADR is committed and referenced in code comments.

27. Configuration Documentation

The pkg/config/config.go adds many RAFT flags but lacks examples of valid configurations.
Recommendation: Add example TOML/YAML configs for:

  • 3-node RAFT cluster
  • 5-node RAFT cluster
  • Single node vs HA mode

28. Operational Runbook Needed

This introduces complex operational scenarios:

  • How to add/remove nodes from cluster
  • How to recover from split-brain
  • How to perform rolling upgrades
  • Backup/restore procedures for RAFT state

Recommendation: Create operational documentation before production use.


✅ What's Done Well

  1. Clean abstraction layers - The interface boundaries are well-defined
  2. Comprehensive E2E test - TestLeaseFailoverE2E exercises real failover scenarios
  3. Proper context propagation - Most functions correctly use context
  4. Good error wrapping - Errors include context with %w
  5. Structured logging - Consistent use of zerolog with context
  6. Graceful shutdown - Proper cleanup with errgroup and defers
  7. Configuration flexibility - Supports RAFT + non-RAFT modes

🎬 Final Recommendations

Priority Order:

  1. Immediate (before merge):

  2. Before Production:

  3. Post-Launch Improvements:

Overall Assessment

This is a well-architected implementation of RAFT-based HA. The core design is sound, but there are critical issues around synchronization safety and performance that must be addressed before production use. The test coverage gap is concerning for such a critical feature.

The 41% patch coverage is insufficient for consensus code. I strongly recommend increasing coverage to at least 80% for pkg/raft/* and node/failover.go before considering this production-ready.


@evstack evstack deleted a comment from claude bot Nov 19, 2025
@alpe alpe requested a review from julienrbrt November 19, 2025 11:35
Copy link
Member

@julienrbrt julienrbrt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

first walkthrough, I'll go read about github.com/hashicorp/raft and come back to review election.go and node.go

return nil
}

// Height returns the current height stored
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need to know the height of the p2p (go header) store? (I am still reviewing, this may get clearer). We can have the app height from the evolve store.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When the node switches from sync to aggregator mode, the internal state is key to prevent double signing.
The Syncer has a isCatchingUpState method now that checks the stores for any height > current.
it is called within the leader election loop to transfer leadership in case it is not fully synced, yet.

}

// SetApplyCallback sets a callback to be called when log entries are applied
func (n *Node) SetApplyCallback(ch chan<- RaftApplyMsg) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: what is this for? the go doc is very light

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The channel is passed by the syncer to receive first level state updates from within the raft cluster. This should be the fastest communication channel available.

}()

// Check raft leadership if raft is enabled
if e.raftNode != nil && !e.raftNode.IsLeader() {
Copy link
Member

@julienrbrt julienrbrt Nov 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unrelated: i wonder how this will play with different sequencers.
In #2797 you can get to that path without node key (to sign). I suppose we'll need to add a condition for based sequencing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, I was only preparing for single sequencer. Base would not work with raft as there are no aggregators.

leaderFactory := func() (raftpkg.Runnable, error) {
logger.Info().Msg("Starting aggregator-MODE")
nodeConfig.Node.Aggregator = true
nodeConfig.P2P.Peers = "" // peers are not supported in aggregator mode
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure I understand this. is the aggregator broadcasting to no one?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the aggregator is required to broadcast to at least one node part of a larger mesh other wise p2p will not work

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is more who calls whom. The aggregator gets called not otherwise. Starting all nodes with p2p-peer setup makes sense though. When a ha cluster is setup, the raft leader gets the aggregator role and I clear the peers when the p2p stack is restarted.
There is an error thrown somewhere when peers are not empty.

node/full.go Outdated
func initRaftNode(nodeConfig config.Config, logger zerolog.Logger) (*raftpkg.Node, error) {
raftDir := nodeConfig.Raft.RaftDir
if raftDir == "" {
raftDir = filepath.Join(nodeConfig.RootDir, "raft")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: we should be using DefaultConfig() value if empty.

bc *block.Components
}

func newSyncMode(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: i was a tiny bit confused this was moved here instead of full.go

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are the constructors. Naming could be better, I guess.

}
return setupFailoverState(nodeConfig, nodeKey, database, genesis, logger, mainKV, rktStore, blockComponentsFn, raftNode)
}
func newAggregatorMode(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

return fmt.Errorf("not leader")
}

data, err := json.Marshal(state) // todo:use protobuf
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why the todo? size?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should migrate to protobuf here. json will cause overhead, at 100ms we need to minimise it as much as possible

* main:
  chore: reduce log noise (#2864)
  fix: sync service for non zero height starts with empty store (#2834)
  build(deps): Bump golang.org/x/crypto from 0.43.0 to 0.45.0 in /execution/evm (#2861)
  chore: minor improvement for docs (#2862)
alpe added 3 commits November 20, 2025 17:24
* main:
  chore: bump da (#2866)
  chore: bump  core (#2865)
* main:
  chore: fix some comments (#2874)
  chore: bump node in evm-single (#2875)
  refactor(syncer,cache): use compare and swap loop and add comments (#2873)
  refactor: use state da height as well (#2872)
  refactor: retrieve highest da height in cache (#2870)
  chore: change from event count to start and end height (#2871)
github-merge-queue bot pushed a commit that referenced this pull request Nov 21, 2025
## Overview

Speed up cache write/loads via parallel execution.  

Pulled from  #2836
github-merge-queue bot pushed a commit that referenced this pull request Nov 21, 2025
## Overview

Minor updates to make it easier to trace errors

Extracted from #2836
alpe added 4 commits November 24, 2025 16:21
* main:
  chore: remove extra github action yml file (#2882)
  fix(execution/evm): verify payload status (#2863)
  feat: fetch included da height from store (#2880)
  chore: better output on errors (#2879)
  refactor!: create da client and split cache interface (#2878)
  chore!: rename `evm-single` and `grpc-single` (#2839)
  build(deps): Bump golang.org/x/crypto from 0.42.0 to 0.45.0 in /tools/da-debug in the go_modules group across 1 directory (#2876)
  chore: parallel cache de/serialization (#2868)
  chore: bump blob size (#2877)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

4 participants