feat: High availabilty via RAFT #2836

alpe · 2025-11-12T15:34:17Z

Implement failover via RAFT

Improve Cache startup/shutdown with parallelization
Publish to RAFT cluster in executor
Sync DB after each block created in executor
Add new RaftReceiver to sync when in aggregator follower mode
Introduce failoverState to switch between follower/leader mode
Provide RAFT node details via http endpoint

github-actions · 2025-11-12T15:34:31Z

The latest Buf updates on your PR. Results from workflow CI / buf-check (pull_request).

Build	Format	Lint	Breaking	Updated (UTC)
`✅ passed`	`⏩ skipped`	`✅ passed`	`✅ passed`	Nov 24, 2025, 4:28 PM

codecov · 2025-11-12T15:39:05Z

Codecov Report

❌ Patch coverage is 41.34831% with 522 lines in your changes missing coverage. Please review.
✅ Project coverage is 62.15%. Comparing base (8cd0fb8) to head (50d7caf).

Files with missing lines	Patch %	Lines
pkg/raft/node.go	12.50%	168 Missing ⚠️
pkg/raft/node_mock.go	45.08%	74 Missing and 21 partials ⚠️
block/internal/syncing/raft_retriever.go	0.00%	60 Missing ⚠️
node/full.go	32.81%	36 Missing and 7 partials ⚠️
node/failover.go	74.45%	22 Missing and 13 partials ⚠️
block/internal/syncing/syncer.go	28.88%	30 Missing and 2 partials ⚠️
block/internal/executing/executor.go	6.25%	25 Missing and 5 partials ⚠️
pkg/raft/election.go	79.26%	12 Missing and 5 partials ⚠️
pkg/rpc/server/http.go	6.66%	13 Missing and 1 partial ⚠️
block/internal/syncing/assert.go	57.89%	4 Missing and 4 partials ⚠️
... and 9 more

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #2836      +/-   ##
==========================================
- Coverage   64.76%   62.15%   -2.61%     
==========================================
  Files          81       88       +7     
  Lines        7350     8065     +715     
==========================================
+ Hits         4760     5013     +253     
- Misses       2049     2475     +426     
- Partials      541      577      +36

Flag	Coverage Δ
combined	`62.15% <41.34%> (-2.61%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

* main: fix: remove duplicate error logging in light node shutdown (#2841) chore: fix incorrect function name in comment (#2840) chore: remove sequencer go.mod (#2837)

* main: build(deps): Bump the go_modules group across 2 directories with 3 updates (#2846) build(deps): Bump github.com/dvsekhvalnov/jose2go from 1.7.0 to 1.8.0 in /test/e2e (#2851) build(deps): Bump github.com/consensys/gnark-crypto from 0.18.0 to 0.18.1 in /test/e2e (#2844) build(deps): Bump github.com/cometbft/cometbft from 0.38.17 to 0.38.19 in /test/e2e (#2843) build(deps): Bump github.com/dvsekhvalnov/jose2go from 1.6.0 to 1.7.0 in /test/e2e (#2845)

(cherry picked from commit c44cd77e665f6d5d463295c6ed61c59a56d88db3)

alpe · 2025-11-19T09:54:45Z

@claude review again

claude · 2025-11-19T09:55:03Z

Claude finished @alpe's task —— View job

Comprehensive Code Review: RAFT-Based High Availability

I've completed a thorough review of the RAFT implementation for high availability. This is a substantial and well-architected feature. Below is my detailed feedback organized by priority and category.

🔴 Critical Issues

1. JSON Serialization Performance Bottleneck (pkg/raft/node.go:211)

Issue: Using JSON for RAFT log replication at 100ms block intervals creates unnecessary overhead.

data, err := json.Marshal(state) // todo:use protobuf

Impact: At 10 blocks/second, JSON marshaling/unmarshaling adds measurable latency. This will compound under load.
Recommendation: Migrate to protobuf immediately. The TODO comment acknowledges this, but it's critical for production. JSON can add 2-5x overhead vs protobuf for binary data.
As noted by @tac0turtle: This should be prioritized before launch.

2. Race Condition Risk in Leadership Transfer (pkg/raft/election.go:96-104)

Issue: The leadership transfer logic has a potential race condition:

raftSynced := d.node.waitForMsgsLanded(d.node.Config().SendTimeout) == nil
if !raftSynced || !runnable.IsSynced(d.node.GetState()) {
    if err := d.node.leadershipTransfer(); err != nil && !errors.Is(err, raft.ErrNotLeader) {
        return err
    }
    continue
}

Problem: Between checking IsSynced() and starting leader operations, new blocks could arrive via P2P/DA that aren't yet in RAFT state. This could lead to double-signing on stale state.
Recommendation: Add a final sync check immediately before starting leader operations, or use a sync barrier pattern.

3. Store Sync Without Error Handling Context (block/internal/executing/executor.go:461-463)

Issue:

if err := e.store.Sync(context.Background()); err != nil {
    return fmt.Errorf("failed to sync store: %w", err)
}

Problem: Using context.Background() ignores parent context cancellation. If the executor is shutting down, we could block indefinitely.
Recommendation: Use the executor's context: e.ctx or pass through the parent context.

4. Missing Validation in FSM.Apply (pkg/raft/node.go:285-287)

Issue: The validation only checks basic monotonicity:

if err := f.state.Load().assertValid(state); err != nil {
    return err
}

Problem: Looking at types.go:15-23, assertValid only checks height/timestamp ordering, not chain continuity (previous hash, state root, etc.).
Recommendation: Add validation for:

Block hash matches header hash
Previous block hash linkage
State root continuity
Signature verification (or document why this is skipped)

🟡 High Priority Issues

5. Channel Buffer Size Risk (block/internal/syncing/raft_retriever.go:58)

Issue:

applyCh := make(chan raft.RaftApplyMsg, 1)

Problem: Buffer of 1 is very small. If RAFT applies logs faster than the syncer can process (e.g., during catch-up after network partition), this will block RAFT FSM.Apply.
Recommendation: Either:

Increase buffer to 100+ to handle burst scenarios
Document the performance implications of buffer=1
Add metrics to track channel saturation

6. Incomplete isCatchingUpState Implementation (block/internal/syncing/syncer.go:171-183)

Issue: Per @julienrbrt's comment, this doesn't capture DA catch-up state:

return s.headerStore.Store().Height() > currentHeight ||
    s.dataStore.Store().Height() > currentHeight ||
    s.raftRetriever != nil && s.raftRetriever.Height() > currentHeight

Problem: If the node is catching up from DA but p2p/raft are synced, it will incorrectly report as synced.
Response to @alpe: While DA is slowest, nodes can receive blocks via P2P faster than DA. The sync state should be comprehensive.
Recommendation: Add DA height check or clearly document the assumption that RAFT nodes don't depend on DA for sync.

7. No Cleanup on Apply Callback Errors (pkg/raft/node.go:279-300)

Issue: FSM.Apply returns errors but doesn't have cleanup/rollback logic:

if err := json.Unmarshal(log.Data, &state); err != nil {
    f.logger.Error().Err(err).Msg("unmarshal block state")
    return err
}

Problem: If unmarshal fails, the log is applied in RAFT but not in FSM state. This creates inconsistency.
Recommendation: Document RAFT's error handling expectations. Consider if this should panic instead (RAFT expects FSM to be deterministic).

8. Follower Mode Doesn't Validate Block Signatures (block/internal/syncing/raft_retriever.go:99-128)

Issue:

if err := header.Header.ValidateBasic(); err != nil {
    r.logger.Debug().Err(err).Msg("invalid header structure")
    return nil  // Silent skip
}

Problem: Only validates structure and proposer address, not cryptographic signature. Followers trust the leader without verification.
Security Risk: A compromised/buggy leader could propagate invalid blocks.
Recommendation: Add signature verification or document the trust model explicitly.

🟢 Medium Priority Issues

9. Non-Atomic Mode Switching (node/failover.go:177-231)

Issue: The failoverState.Run starts multiple services sequentially with deferred cleanup:

if err := f.p2pClient.Start(ctx); err != nil {
    return fmt.Errorf("start p2p: %w", err)
}
defer f.p2pClient.Close()
// ... more starts with defers

Problem: If startup fails midway, cleanup order may not match startup order. Defers run in reverse, but some services may have dependencies.
Recommendation: Use errgroup for parallel startup and explicit cleanup order on error.

10. P2P Peer Clearing Side Effect (node/full.go:100)

Issue:

nodeConfig.P2P.Peers = "" // peers are not supported in aggregator mode

Problem: This mutates shared config that may be used elsewhere. Side effects on config structs can cause subtle bugs.
Recommendation: Create a copy of the config or use a dedicated flag for peer mode.

11. Missing Metrics for RAFT Operations

Observation: No custom metrics for:

RAFT leader elections
Leadership transfer events
FSM apply latency
RAFT log size/lag
Failover events
Recommendation: Add metrics for operational visibility. RAFT-based systems need monitoring for split-brain detection and performance issues.

12. waitForMsgsLanded Polling Efficiency (pkg/raft/node.go:148-167)

Issue:

ticker := time.NewTicker(min(n.config.SendTimeout, timeout) / 2)
for {
    select {
    case <-ticker.C:
        if n.raft.AppliedIndex() >= n.raft.LastIndex() {
            return nil
        }

Problem: Polling with ticker when RAFT likely has better mechanisms for notification.
Recommendation: Investigate if hashicorp/raft provides a channel or callback for AppliedIndex updates to avoid polling.

🔵 Code Quality & Best Practices

13. Inconsistent Nil Checks

Several functions start with:

if n == nil {
    return nil
}

This pattern appears in:

node.go:108-110, 149-151, 169-172, 177-180

Issue: Calling methods on nil receivers is valid in Go but often indicates a design smell. Better to panic early or return errors.
Recommendation: Document why nil nodes are allowed, or remove this pattern and require non-nil invariant.

14. Error Message Consistency

node.go:331: "not raft leader" (lowercase, no article)
election.go:100: "became leader, but not synced. Pass on leadership" (informal tone)

Recommendation: Standardize error messages. Use sentence case and consistent terminology.

15. Magic Numbers in Configuration

election.go:83: ticker := time.NewTicker(300 * time.Millisecond)
failover.go:158-160: ReadTimeout: 10 * time.Second, WriteTimeout: 10 * time.Second, IdleTimeout: 120 * time.Second
node.go:58: raftConfig.LeaderLeaseTimeout = cfg.HeartbeatTimeout / 2

Recommendation: Move to named constants or configuration parameters with explanatory comments.

16. Duplicate Code in Failover Factories (node/failover.go:40-107)

The newSyncMode and newAggregatorMode functions are nearly identical with only the component builder differing.
Recommendation: Extract common logic into a helper function to reduce duplication.

17. Verbose Deferred Error Handling

Pattern appears frequently:

defer func() {
    if err := f.bc.Stop(); err != nil && !errors.Is(err, context.Canceled) {
        rerr = errors.Join(rerr, fmt.Errorf("stopping block components: %w", err))
    }
}()

Recommendation: Consider a helper function deferStop to reduce verbosity.

📋 Testing & Coverage Observations

18. Coverage Drop is Concerning

Overall coverage: 64.76% → 62.24% (-2.52%)
Patch coverage: 41.34% (522 lines missing)
Core files with low coverage:
- pkg/raft/node.go: 12.50% (168 missing)
- block/internal/syncing/raft_retriever.go: 0.00% (60 missing)
- node/failover.go: 74.45% (35 missing)

Critical Gap: The RAFT retriever has zero coverage. This is a critical component that handles consensus data.

Recommendation:

Add unit tests for raft_retriever.go covering:
- Invalid block handling
- Concurrent apply operations
- Stop/restart scenarios
Add integration tests for:
- Leader failover with transactions in flight
- Network partition scenarios
- Split-brain prevention
The E2E test TestLeaseFailoverE2E is good but insufficient alone

19. Test Coverage for Race Conditions

The election logic has complex concurrent state transitions but tests don't cover:

Rapid leader/follower oscillation
Multiple leadership changes in quick succession
Context cancellation during state transition

Recommendation: Add fuzzing or stress tests with -race flag enabled.

🎯 Architecture & Design

20. Excellent Separation of Concerns

✅ The separation between DynamicLeaderElection, Node, and failoverState is clean.
✅ Using factory functions for leader/follower construction is elegant.
✅ The Runnable interface abstracts the mode nicely.

21. Good Use of Atomic State

✅ FSM.state uses atomic.Pointer[RaftBlockState] correctly.
✅ running flag properly synchronized.

22. RAFT Integration Strategy

✅ Using hashicorp/raft is solid - production-proven library.
✅ TCP transport with BoltDB backend is appropriate for this scale.
✅ Snapshot configuration is sensible.

23. Failover Design Question

The current design uses RAFT for leader election but still broadcasts to P2P.

Question: Why maintain P2P broadcast from leader to followers if RAFT is already replicating state?
Benefit: Redundancy and allows non-RAFT nodes to sync
Tradeoff: Double bandwidth for RAFT cluster members
Recommendation: Document this design decision. Consider allowing RAFT-only mode where followers don't need P2P.

🔒 Security Review

24. Limited Byzantine Fault Tolerance

RAFT provides crash fault tolerance (CFT), not Byzantine fault tolerance (BFT). If a leader is compromised and signs invalid blocks:

Followers will replicate invalid state
No proof-of-fraud mechanism exists

Mitigation: The block signature verification in the executor provides some protection, but followers in RAFT mode don't verify.
Recommendation: Document trust assumptions and consider if BFT consensus is needed for production.

25. No TLS on RAFT Transport

transport, err := raft.NewTCPTransport(cfg.RaftAddr, addr, 3, 10*time.Second, os.Stderr)

Issue: RAFT traffic is unencrypted. Leader-follower communication includes block data and consensus messages.
Recommendation: Add TLS configuration for production deployments.

📝 Documentation Gaps

26. Missing ADR Reference

The PR description mentions adr023_ha-failover but the ADR isn't linked or included.
Recommendation: Ensure ADR is committed and referenced in code comments.

27. Configuration Documentation

The pkg/config/config.go adds many RAFT flags but lacks examples of valid configurations.
Recommendation: Add example TOML/YAML configs for:

3-node RAFT cluster
5-node RAFT cluster
Single node vs HA mode

28. Operational Runbook Needed

This introduces complex operational scenarios:

How to add/remove nodes from cluster
How to recover from split-brain
How to perform rolling upgrades
Backup/restore procedures for RAFT state

Recommendation: Create operational documentation before production use.

✅ What's Done Well

Clean abstraction layers - The interface boundaries are well-defined
Comprehensive E2E test - TestLeaseFailoverE2E exercises real failover scenarios
Proper context propagation - Most functions correctly use context
Good error wrapping - Errors include context with %w
Structured logging - Consistent use of zerolog with context
Graceful shutdown - Proper cleanup with errgroup and defers
Configuration flexibility - Supports RAFT + non-RAFT modes

🎬 Final Recommendations

Priority Order:

Immediate (before merge):
- Fix critical race condition in leadership transfer (Issue Peer discovery #2)
- Fix context.Background() in store sync (Issue Transaction gossiping #3)
- Add basic validation in FSM.Apply (Issue Fraud proofs gossiping #4)
- Increase RAFT apply channel buffer (Issue Mempool #5)
Before Production:
- Migrate to protobuf (Issue ABCI interface #1) - performance critical
- Add comprehensive test coverage (Issue Define Leader Selection for Cosmos-SDK based Rollup chain #18)
- Implement metrics (Issue Fold the validator roots into the state tree #11)
- Add signature verification in follower mode (Issue Data Availability layer API #8)
- TLS for RAFT transport (Issue Unify logging #25)
Post-Launch Improvements:
- Refactor duplicate code (Issue Node address format #16)
- Improve polling efficiency (Issue Initial project setup #12)
- Complete operational documentation (Issue simple scaffolding tool #28)

Overall Assessment

This is a well-architected implementation of RAFT-based HA. The core design is sound, but there are critical issues around synchronization safety and performance that must be addressed before production use. The test coverage gap is concerning for such a critical feature.

The 41% patch coverage is insufficient for consensus code. I strongly recommend increasing coverage to at least 80% for pkg/raft/* and node/failover.go before considering this production-ready.

julienrbrt

first walkthrough, I'll go read about github.com/hashicorp/raft and come back to review election.go and node.go

block/internal/cache/generic_cache.go

pkg/cmd/run_node.go

julienrbrt · 2025-11-19T13:50:26Z

pkg/sync/sync_service.go

 	return nil
 }

+// Height returns the current height stored


Why do we need to know the height of the p2p (go header) store? (I am still reviewing, this may get clearer). We can have the app height from the evolve store.

When the node switches from sync to aggregator mode, the internal state is key to prevent double signing.
The Syncer has a isCatchingUpState method now that checks the stores for any height > current.
it is called within the leader election loop to transfer leadership in case it is not fully synced, yet.

julienrbrt · 2025-11-19T13:51:59Z

pkg/raft/node.go

+}
+
+// SetApplyCallback sets a callback to be called when log entries are applied
+func (n *Node) SetApplyCallback(ch chan<- RaftApplyMsg) {


nit: what is this for? the go doc is very light

The channel is passed by the syncer to receive first level state updates from within the raft cluster. This should be the fastest communication channel available.

julienrbrt · 2025-11-19T13:54:16Z

block/internal/executing/executor.go

 	}()

+	// Check raft leadership if raft is enabled
+	if e.raftNode != nil && !e.raftNode.IsLeader() {


unrelated: i wonder how this will play with different sequencers.
In #2797 you can get to that path without node key (to sign). I suppose we'll need to add a condition for based sequencing.

yes, I was only preparing for single sequencer. Base would not work with raft as there are no aggregators.

julienrbrt · 2025-11-19T14:02:47Z

node/full.go

+	leaderFactory := func() (raftpkg.Runnable, error) {
+		logger.Info().Msg("Starting aggregator-MODE")
+		nodeConfig.Node.Aggregator = true
+		nodeConfig.P2P.Peers = "" // peers are not supported in aggregator mode


not sure I understand this. is the aggregator broadcasting to no one?

the aggregator is required to broadcast to at least one node part of a larger mesh other wise p2p will not work

This is more who calls whom. The aggregator gets called not otherwise. Starting all nodes with p2p-peer setup makes sense though. When a ha cluster is setup, the raft leader gets the aggregator role and I clear the peers when the p2p stack is restarted.
There is an error thrown somewhere when peers are not empty.

julienrbrt · 2025-11-19T14:04:02Z

node/full.go

+func initRaftNode(nodeConfig config.Config, logger zerolog.Logger) (*raftpkg.Node, error) {
+	raftDir := nodeConfig.Raft.RaftDir
+	if raftDir == "" {
+		raftDir = filepath.Join(nodeConfig.RootDir, "raft")


nit: we should be using DefaultConfig() value if empty.

julienrbrt · 2025-11-19T14:05:57Z

node/failover.go

+	bc                *block.Components
+}
+
+func newSyncMode(


nit: i was a tiny bit confused this was moved here instead of full.go

These are the constructors. Naming could be better, I guess.

julienrbrt · 2025-11-19T14:06:05Z

node/failover.go

+	}
+	return setupFailoverState(nodeConfig, nodeKey, database, genesis, logger, mainKV, rktStore, blockComponentsFn, raftNode)
+}
+func newAggregatorMode(


julienrbrt · 2025-11-19T14:10:39Z

pkg/raft/node.go

+		return fmt.Errorf("not leader")
+	}
+
+	data, err := json.Marshal(state) // todo:use protobuf


why the todo? size?

we should migrate to protobuf here. json will cause overhead, at 100ms we need to minimise it as much as possible

* main: chore: reduce log noise (#2864) fix: sync service for non zero height starts with empty store (#2834) build(deps): Bump golang.org/x/crypto from 0.43.0 to 0.45.0 in /execution/evm (#2861) chore: minor improvement for docs (#2862)

* main: chore: bump da (#2866) chore: bump core (#2865)

* main: chore: fix some comments (#2874) chore: bump node in evm-single (#2875) refactor(syncer,cache): use compare and swap loop and add comments (#2873) refactor: use state da height as well (#2872) refactor: retrieve highest da height in cache (#2870) chore: change from event count to start and end height (#2871)

## Overview Speed up cache write/loads via parallel execution. Pulled from #2836

## Overview Minor updates to make it easier to trace errors Extracted from #2836

* main: chore: remove extra github action yml file (#2882) fix(execution/evm): verify payload status (#2863) feat: fetch included da height from store (#2880) chore: better output on errors (#2879) refactor!: create da client and split cache interface (#2878) chore!: rename `evm-single` and `grpc-single` (#2839) build(deps): Bump golang.org/x/crypto from 0.42.0 to 0.45.0 in /tools/da-debug in the go_modules group across 1 directory (#2876) chore: parallel cache de/serialization (#2868) chore: bump blob size (#2877)

alpe added 2 commits November 12, 2025 15:16

Add high availabiliy via raft

d4c7e7d

Prefent duplicate peers

820b9a3

github-project-automation bot added this to Evolve Nov 12, 2025

github-actions bot assigned alpe Nov 12, 2025

alpe added 4 commits November 17, 2025 10:55

Merge branch 'main' into alex/adr023_ha-failover_raft7

35cf0c8

* main: fix: remove duplicate error logging in light node shutdown (#2841) chore: fix incorrect function name in comment (#2840) chore: remove sequencer go.mod (#2837)

Go mod tidy

45a6e27

Minor updates

4562ded

alpe changed the title ~~[WIP] feat: High availabilty via RAFT~~ feat: High availabilty via RAFT Nov 17, 2025

alpe added 4 commits November 18, 2025 11:00

Merge branch 'main' into alex/adr023_ha-failover_raft7

4d47754

Fix test

0804346

Leadership fix

1c2e3f1

(cherry picked from commit c44cd77e665f6d5d463295c6ed61c59a56d88db3)

Fix test

c320a7a

alpe marked this pull request as ready for review November 19, 2025 09:54

alpe added 2 commits November 19, 2025 11:05

Remove sleep

72ec5fe

Increase buffer size for raft msgs

b34d7a0

evstack deleted a comment from claude bot Nov 19, 2025

alpe requested a review from julienrbrt November 19, 2025 11:35

julienrbrt reviewed Nov 19, 2025

View reviewed changes

This was referenced Nov 20, 2025

chore: parallel cache de/serialization #2868

Merged

chore: sync store on new block #2869

Closed

alpe added 3 commits November 20, 2025 17:24

Review feedback

406377e

Merge branch 'main' into alex/adr023_ha-failover_raft7

0ff25b8

* main: chore: bump da (#2866) chore: bump core (#2865)

github-merge-queue bot pushed a commit that referenced this pull request Nov 21, 2025

chore: parallel cache de/serialization (#2868)

1644d37

## Overview Speed up cache write/loads via parallel execution. Pulled from #2836

Add replace to go.mod

406ad8b

alpe mentioned this pull request Nov 21, 2025

chore: better output on errors #2879

Merged

github-merge-queue bot pushed a commit that referenced this pull request Nov 21, 2025

chore: better output on errors (#2879)

82299ed

## Overview Minor updates to make it easier to trace errors Extracted from #2836

alpe added 4 commits November 24, 2025 16:21

Fix merge conflicts

abfa8d3

Linter fix

aebbdff

Minor test update

50d7caf

feat: High availabilty via RAFT #2836

Are you sure you want to change the base?

feat: High availabilty via RAFT #2836

Uh oh!

Conversation

alpe commented Nov 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Nov 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov bot commented Nov 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

alpe commented Nov 19, 2025

Uh oh!

claude bot commented Nov 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Comprehensive Code Review: RAFT-Based High Availability

🔴 Critical Issues

1. JSON Serialization Performance Bottleneck (pkg/raft/node.go:211)

2. Race Condition Risk in Leadership Transfer (pkg/raft/election.go:96-104)

3. Store Sync Without Error Handling Context (block/internal/executing/executor.go:461-463)

4. Missing Validation in FSM.Apply (pkg/raft/node.go:285-287)

🟡 High Priority Issues

5. Channel Buffer Size Risk (block/internal/syncing/raft_retriever.go:58)

6. Incomplete isCatchingUpState Implementation (block/internal/syncing/syncer.go:171-183)

7. No Cleanup on Apply Callback Errors (pkg/raft/node.go:279-300)

8. Follower Mode Doesn't Validate Block Signatures (block/internal/syncing/raft_retriever.go:99-128)

🟢 Medium Priority Issues

9. Non-Atomic Mode Switching (node/failover.go:177-231)

10. P2P Peer Clearing Side Effect (node/full.go:100)

11. Missing Metrics for RAFT Operations

12. waitForMsgsLanded Polling Efficiency (pkg/raft/node.go:148-167)

🔵 Code Quality & Best Practices

13. Inconsistent Nil Checks

14. Error Message Consistency

15. Magic Numbers in Configuration

16. Duplicate Code in Failover Factories (node/failover.go:40-107)

17. Verbose Deferred Error Handling

📋 Testing & Coverage Observations

18. Coverage Drop is Concerning

19. Test Coverage for Race Conditions

🎯 Architecture & Design

20. Excellent Separation of Concerns

21. Good Use of Atomic State

22. RAFT Integration Strategy

23. Failover Design Question

🔒 Security Review

24. Limited Byzantine Fault Tolerance

25. No TLS on RAFT Transport

📝 Documentation Gaps

26. Missing ADR Reference

27. Configuration Documentation

28. Operational Runbook Needed

✅ What's Done Well

🎬 Final Recommendations

Priority Order:

Overall Assessment

Uh oh!

julienrbrt left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

julienrbrt Nov 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alpe commented Nov 12, 2025 •

edited

Loading

github-actions bot commented Nov 12, 2025 •

edited

Loading

codecov bot commented Nov 12, 2025 •

edited

Loading

claude bot commented Nov 19, 2025 •

edited

Loading

julienrbrt Nov 19, 2025 •

edited

Loading