chore: sync store on new block #2869

alpe · 2025-11-20T15:27:51Z

Overview

After each block, the store is synced to ensure the state is persisted. This is key to prevent double signing on dirty state.
I would expect cost on overal performance to do this but I don't have numbers.

Pulled from #2836

Some benchmark results on my box:

goos: darwin
goarch: arm64
pkg: github.com/evstack/ev-node/pkg/store

BenchmarkStoreSync/WithoutSync_empty-12         	   96385	     12023 ns/op
BenchmarkStoreSync/WithSync_empty-12            	   33241	     42937 ns/op
BenchmarkStoreSync/WithoutSync_small-12         	   95199	     14016 ns/op
BenchmarkStoreSync/WithSync_small-12            	   25992	     66435 ns/op
BenchmarkStoreSync/WithoutSync_large-12         	   10000	    109723 ns/op
BenchmarkStoreSync/WithSync_large-12            	    7569	    272166 ns/op
PASS

gemini-code-assist · 2025-11-20T15:28:06Z

Summary of Changes

Hello @alpe, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a critical change to enhance data persistence by ensuring that the store's state is explicitly synced to disk after every block is processed and committed. This measure is designed to prevent potential data inconsistencies and issues like double signing that could arise from unpersisted state, with the understanding that there might be an impact on overall performance.

Highlights

Store Interface Update: A new Sync method has been added to the Store interface, allowing explicit flushing of the store's state to disk.
DefaultStore Implementation: The DefaultStore now includes an implementation for the Sync method, which internally calls the underlying database's Sync function to persist data.
Block Execution Persistence: After each block is produced and its batch committed, the Executor now explicitly calls store.Sync to ensure the state is immediately persisted to disk.
Mocking Support: The MockStore has been updated with a mock implementation for the new Sync method, facilitating testing.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

github-actions · 2025-11-20T15:28:08Z

The latest Buf updates on your PR. Results from workflow CI / buf-check (pull_request).

Build	Format	Lint	Breaking	Updated (UTC)
`✅ passed`	`⏩ skipped`	`✅ passed`	`✅ passed`	Nov 24, 2025, 2:34 PM

claude · 2025-11-20T15:28:11Z

Claude finished @alpe's task —— View job

Comprehensive Code Review

I've completed a thorough review of the store sync implementation. This PR adds a critical safety feature to prevent double signing by ensuring state is persisted to disk after each block. Below is my detailed analysis:

🎯 Overview Assessment

Purpose: Add explicit Sync() calls after block commits to ensure durability and prevent double signing on crash/restart scenarios.

Performance Impact: Benchmarks show 2-3x slowdown (empty: 12µs → 43µs, small: 14µs → 66µs, large: 110µs → 272µs). This is a significant tradeoff between safety and throughput.

🔍 Detailed Findings

1. Code Quality & Architecture ⚠️

✅ Positives:

Clean API design with Sync() added to the Store interface (pkg/store/types.go:48-49)
Follows zero-dependency core package pattern
Simple, focused implementation delegates to underlying datastore
Proper mock generation included

⚠️ Concerns:

API Design Question (relates to @julienrbrt's comment):
The separate Sync() method exposes an implementation detail to callers. Consider whether this belongs in the interface or should be encapsulated:

Option A (current): Explicit Sync() gives callers fine-grained control over durability/performance tradeoffs
Option B: Integrate into Batch.Commit() with a flag/option (e.g., CommitOptions{Sync: bool})
Option C: Async sync strategy (sync every N blocks or on a timer)

The current approach is valid but forces all callers to understand durability semantics. A more idiomatic Go approach might hide this complexity unless explicit control is needed.

Reference: store.go:193-196, types.go:48-49

2. Error Handling & Resilience ⚠️

Context.Background() Usage (block/internal/executing/executor.go:416):

if err := e.store.Sync(context.Background()); err != nil {
    return fmt.Errorf("failed to sync store: %w", err)
}

Issues:

Ignores cancellation signals: Using context.Background() means if e.ctx is cancelled during shutdown, the sync still blocks indefinitely
No timeout protection: If BadgerDB's Sync hangs (disk I/O issues, filesystem problems), the executor halts permanently
Inconsistent with codebase patterns: Other operations use e.ctx for proper lifecycle management

Recommendations:

// Option 1: Use executor context with timeout fallback
syncCtx := e.ctx
if err := e.store.Sync(syncCtx); err != nil {
    return fmt.Errorf("failed to sync store: %w", err)
}

// Option 2: Add timeout for safety (if sync must complete even on shutdown)
syncCtx, cancel := context.WithTimeout(context.Background(), 30*time.Second)
defer cancel()
if err := e.store.Sync(syncCtx); err != nil {
    return fmt.Errorf("failed to sync store: %w", err)
}

Error Propagation: Good - errors are properly wrapped with context and returned to halt block production on sync failure.

3. Security ✅

No security concerns identified:

No credential handling in this change
No user input validation needed (internal operation)
Proper error handling prevents silent failures
The feature itself improves security (prevents double signing)

4. Performance & Resource Efficiency ⚠️

Critical Performance Impact:

The benchmarks reveal significant overhead:

Scenario	Without Sync	With Sync	Overhead
Empty	12µs	43µs	3.6x slower
Small (10 txs)	14µs	66µs	4.7x slower
Large (1000 txs)	110µs	272µs	2.5x slower

Analysis:

BadgerDB uses a Write-Ahead Log (WAL) that provides crash recovery by default
The Sync() call forces an fsync() system call, flushing OS page cache to physical disk
This is necessary for true durability but comes at a steep cost

Tradeoff Considerations (addressing @tac0turtle's concern):

The current BadgerDB setup already has:

✅ WAL for crash recovery
✅ Memory table (MemTable) for buffering
❌ No explicit sync after commits (relies on WAL + background flush)

Without explicit Sync: Crash could lose the last committed block → node restarts, may double-sign if it doesn't detect the uncommitted block in WAL

With explicit Sync: Guarantees durability but reduces throughput by 60-80%

Potential Optimizations:

Conditional Sync: Only sync every N blocks (e.g., every 10 blocks)
Async Sync: Fire-and-forget sync in goroutine with monitoring
Batched Sync: Accumulate multiple blocks, sync once
Double-sign detection: Enhance restart logic to detect partial state in WAL

Reference: benchmark_test.go:1-71, executor.go:416

5. Testing & Reliability ⚠️

✅ Positives:

Excellent benchmark implementation covering empty/small/large blocks
Uses table-driven tests
Proper cleanup with b.Cleanup()
Benchmarks use realistic block data via types.GetRandomBlock()

❌ Missing Coverage:

No unit tests for Sync() method itself - only benchmarks
No error case testing: What happens when Sync() fails?
No integration test: Does block production actually halt on sync failure?
No concurrency tests: Can multiple syncs happen simultaneously?
Coverage at 0%: All 4 new lines are untested in CI

Recommended Test Additions:

// Unit test for Sync success
func TestStore_Sync_Success(t *testing.T) { ... }

// Unit test for Sync failure  
func TestStore_Sync_Error(t *testing.T) { ... }

// Integration test: block production halts on sync failure
func TestExecutor_SyncFailure_HaltsProduction(t *testing.T) { ... }

Reference: benchmark_test.go, codecov report showing 0% patch coverage

6. Documentation & Developer Experience ⚠️

Missing Documentation:

No godoc comment on the interface method (types.go:48-49):
```
// Sync flushes the store state to disk
Sync(ctx context.Context) error
```
Should explain:
- When to call this (after critical state changes)
- Performance implications
- Error handling expectations
- Whether it's safe to call concurrently
No inline comment explaining the safety vs. performance tradeoff at the call site (executor.go:416)
Benchmark results are in PR description but not in code comments or docs

Recommendation:

// Sync forces a flush of all pending writes to durable storage.
// This is a blocking operation that calls fsync() on the underlying
// database, ensuring crash consistency at the cost of significant
// performance overhead (see benchmark_test.go for measurements).
//
// Call this after critical state transitions where data loss would
// cause safety violations (e.g., double signing).
//
// It is safe to call concurrently, but performance will degrade.
Sync(ctx context.Context) error

🏗️ Architectural Observations

Consistency with ev-node Patterns:

✅ Interface-first design: Properly added to Store interface in pkg/store
✅ Dependency injection: Executor receives Store interface, not concrete type
✅ Error propagation: Uses fmt.Errorf with %w for error wrapping
✅ Mock generation: Updated mocks via mockery
⚠️ Context usage: Should use executor's context, not context.Background()

📊 Priority Recommendations

High Priority (Before Merge):

Fix context usage (executor.go:416): Use e.ctx or add timeout to prevent indefinite hangs
Add unit tests: Test both success and failure paths for Sync()
Document the interface method: Explain when/why to call and performance implications

Medium Priority (Consider for this PR):

Decide on API design: Should this be part of Commit() instead of separate?
Add integration test: Verify block production halts on sync failure
Benchmark on target hardware: ARM64 macOS results may not reflect production Linux performance

Low Priority (Future Work):

Explore optimization strategies: Async sync, batched sync, or conditional sync
Add metrics: Track sync duration and failures
Consider configurability: Let operators choose safety vs. performance

💭 Final Thoughts

This PR addresses a real safety concern (double signing after crash), and the implementation is clean and straightforward. However, the 2-5x performance penalty is substantial and warrants careful consideration.

Key Questions for the Team:

Is the performance cost acceptable? What's the target block time and throughput?
Does BadgerDB's WAL provide sufficient safety without explicit sync? (May need deeper analysis of BadgerDB crash recovery guarantees)
Should sync be configurable? Allow operators to choose based on their risk tolerance?
Can we detect uncommitted blocks on restart and avoid re-proposing rather than forcing sync? (Defensive approach at startup vs. expensive sync at runtime)

The code is production-ready from a correctness standpoint but needs:

Context handling fix
Test coverage
Documentation
Team decision on performance tradeoffs

Files Reviewed:

block/internal/executing/executor.go:416-418 ⚠️ Context usage issue
pkg/store/store.go:193-196 ✅ Clean implementation
pkg/store/types.go:48-49 ⚠️ Needs documentation
pkg/store/benchmark_test.go:1-71 ✅ Excellent benchmarks
test/mocks/store.go ✅ Properly generated

gemini-code-assist

Code Review

This pull request introduces a Sync operation after each block is committed to the store. This is a crucial change to ensure data is persisted to disk immediately, preventing potential state inconsistencies or double signing in case of a crash. The implementation adds a Sync method to the Store interface and calls it in the block production flow. My review focused on the correctness of this implementation and its implications. I have one comment regarding the use of context.Background() for the sync operation, highlighting a trade-off between data integrity on shutdown and the risk of hanging the shutdown process.

gemini-code-assist · 2025-11-20T15:29:56Z

block/internal/executing/executor.go

 	if err := batch.Commit(); err != nil {
 		return fmt.Errorf("failed to commit batch: %w", err)
 	}
+	if err := e.store.Sync(context.Background()); err != nil {


Using context.Background() here ensures the Sync operation completes even during a graceful shutdown, which is important for data integrity. However, this comes with a trade-off: if Sync were to hang, it would block the shutdown process indefinitely because it's not cancellable. While this might be an acceptable risk, a more robust solution could involve using a context with a timeout for this critical operation. This would prevent indefinite hangs while still giving the sync operation a grace period to complete during shutdown.

codecov · 2025-11-20T15:31:48Z

Codecov Report

❌ Patch coverage is 0% with 4 lines in your changes missing coverage. Please review.
✅ Project coverage is 67.23%. Comparing base (faabb32) to head (6efb020).
⚠️ Report is 12 commits behind head on main.

Files with missing lines	Patch %	Lines
block/internal/executing/executor.go	0.00%	1 Missing and 1 partial ⚠️
pkg/store/store.go	0.00%	2 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #2869      +/-   ##
==========================================
+ Coverage   64.94%   67.23%   +2.29%     
==========================================
  Files          81       82       +1     
  Lines        7262     8087     +825     
==========================================
+ Hits         4716     5437     +721     
- Misses       2001     2074      +73     
- Partials      545      576      +31

Flag	Coverage Δ
combined	`67.23% <0.00%> (+2.29%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

tac0turtle · 2025-11-20T15:36:01Z

block/internal/executing/executor.go

 	if err := batch.Commit(); err != nil {
 		return fmt.Errorf("failed to commit batch: %w", err)
 	}
+	if err := e.store.Sync(context.Background()); err != nil {


how much does this cost in terms of overhead?

This is a fair question. I have not benchmarked this as I can only provide random data which may be very different to what happens on a chain.
But the question is more safety vs speed. Without sync, there is a risk for double signing after kill/restart.

we should benchmark this. if on restart we are losing the previous block and reproposing a new block then i think its an issue in how we handle the flow, taking a performance hit like this is something to be careful about. Currently we are using the badgerdb system with its WAL and memory store before this change.

I have added a benchmark and updated the description with some results. You were right to ask for the benchmark. 👍
If the numbers are not acceptable for the use case, we could find some trade off to balance safety for speed with async call to sync (😄 ) or a sync every n blocks for example.

julienrbrt · 2025-11-20T21:30:04Z

pkg/store/store.go

 	return data, nil
 }

+// Sync flushes the store state to disk


Shouldn't this be abstracted within Commit? Keeping this method out of the interface, and just have Commit handle saving.

This is a good discussion point. I don't mind the additional method on the interface but depending on the internal implementation less data piles up at the block production.

tac0turtle · 2025-11-24T14:34:11Z

discussion in standup was we dont need this, can we close it

tac0turtle · 2025-11-24T14:34:31Z

sorry didnt mean to close

Sync store on new block

387b43e

github-actions bot assigned alpe Nov 20, 2025

alpe changed the title ~~Sync store on new block~~ chore: sync store on new block Nov 20, 2025

gemini-code-assist bot reviewed Nov 20, 2025

View reviewed changes

tac0turtle reviewed Nov 20, 2025

View reviewed changes

Merge branch 'main' into alex/store_sync

270d9be

julienrbrt reviewed Nov 20, 2025

View reviewed changes

alpe added 2 commits November 21, 2025 09:46

Merge branch 'main' into alex/store_sync

0994a4a

Add store sync benchmark

6efb020

tac0turtle closed this Nov 24, 2025

tac0turtle reopened this Nov 24, 2025

chore: sync store on new block #2869

Are you sure you want to change the base?

chore: sync store on new block #2869

Conversation

alpe commented Nov 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview

Uh oh!

gemini-code-assist bot commented Nov 20, 2025

Summary of Changes

Highlights

Footnotes

Uh oh!

github-actions bot commented Nov 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

claude bot commented Nov 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Comprehensive Code Review

🎯 Overview Assessment

🔍 Detailed Findings

1. Code Quality & Architecture ⚠️

2. Error Handling & Resilience ⚠️

3. Security ✅

4. Performance & Resource Efficiency ⚠️

5. Testing & Reliability ⚠️

6. Documentation & Developer Experience ⚠️

🏗️ Architectural Observations

📊 Priority Recommendations

High Priority (Before Merge):

Medium Priority (Consider for this PR):

Low Priority (Future Work):

💭 Final Thoughts

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Nov 20, 2025

Choose a reason for hiding this comment

Uh oh!

codecov bot commented Nov 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

tac0turtle Nov 20, 2025

Choose a reason for hiding this comment

Uh oh!

alpe Nov 20, 2025

Choose a reason for hiding this comment

Uh oh!

tac0turtle Nov 20, 2025

Choose a reason for hiding this comment

Uh oh!

alpe Nov 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

julienrbrt Nov 20, 2025

Choose a reason for hiding this comment

Uh oh!

alpe Nov 21, 2025

Choose a reason for hiding this comment

Uh oh!

tac0turtle commented Nov 24, 2025

Uh oh!

tac0turtle commented Nov 24, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

alpe commented Nov 20, 2025 •

edited

Loading

github-actions bot commented Nov 20, 2025 •

edited

Loading

claude bot commented Nov 20, 2025 •

edited

Loading

codecov bot commented Nov 20, 2025 •

edited

Loading

alpe Nov 21, 2025 •

edited

Loading