Skip to content

Conversation

@alpe
Copy link
Contributor

@alpe alpe commented Nov 20, 2025

Overview

After each block, the store is synced to ensure the state is persisted. This is key to prevent double signing on dirty state.
I would expect cost on overal performance to do this but I don't have numbers.

Pulled from #2836

Some benchmark results on my box:

goos: darwin
goarch: arm64
pkg: github.com/evstack/ev-node/pkg/store

BenchmarkStoreSync/WithoutSync_empty-12         	   96385	     12023 ns/op
BenchmarkStoreSync/WithSync_empty-12            	   33241	     42937 ns/op
BenchmarkStoreSync/WithoutSync_small-12         	   95199	     14016 ns/op
BenchmarkStoreSync/WithSync_small-12            	   25992	     66435 ns/op
BenchmarkStoreSync/WithoutSync_large-12         	   10000	    109723 ns/op
BenchmarkStoreSync/WithSync_large-12            	    7569	    272166 ns/op
PASS

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @alpe, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a critical change to enhance data persistence by ensuring that the store's state is explicitly synced to disk after every block is processed and committed. This measure is designed to prevent potential data inconsistencies and issues like double signing that could arise from unpersisted state, with the understanding that there might be an impact on overall performance.

Highlights

  • Store Interface Update: A new Sync method has been added to the Store interface, allowing explicit flushing of the store's state to disk.
  • DefaultStore Implementation: The DefaultStore now includes an implementation for the Sync method, which internally calls the underlying database's Sync function to persist data.
  • Block Execution Persistence: After each block is produced and its batch committed, the Executor now explicitly calls store.Sync to ensure the state is immediately persisted to disk.
  • Mocking Support: The MockStore has been updated with a mock implementation for the new Sync method, facilitating testing.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@github-actions
Copy link
Contributor

github-actions bot commented Nov 20, 2025

The latest Buf updates on your PR. Results from workflow CI / buf-check (pull_request).

BuildFormatLintBreakingUpdated (UTC)
✅ passed⏩ skipped✅ passed✅ passedNov 24, 2025, 2:34 PM

@claude
Copy link
Contributor

claude bot commented Nov 20, 2025

Claude finished @alpe's task —— View job


Comprehensive Code Review

I've completed a thorough review of the store sync implementation. This PR adds a critical safety feature to prevent double signing by ensuring state is persisted to disk after each block. Below is my detailed analysis:


🎯 Overview Assessment

Purpose: Add explicit Sync() calls after block commits to ensure durability and prevent double signing on crash/restart scenarios.

Performance Impact: Benchmarks show 2-3x slowdown (empty: 12µs → 43µs, small: 14µs → 66µs, large: 110µs → 272µs). This is a significant tradeoff between safety and throughput.


🔍 Detailed Findings

1. Code Quality & Architecture ⚠️

✅ Positives:

  • Clean API design with Sync() added to the Store interface (pkg/store/types.go:48-49)
  • Follows zero-dependency core package pattern
  • Simple, focused implementation delegates to underlying datastore
  • Proper mock generation included

⚠️ Concerns:

API Design Question (relates to @julienrbrt's comment):
The separate Sync() method exposes an implementation detail to callers. Consider whether this belongs in the interface or should be encapsulated:

  • Option A (current): Explicit Sync() gives callers fine-grained control over durability/performance tradeoffs
  • Option B: Integrate into Batch.Commit() with a flag/option (e.g., CommitOptions{Sync: bool})
  • Option C: Async sync strategy (sync every N blocks or on a timer)

The current approach is valid but forces all callers to understand durability semantics. A more idiomatic Go approach might hide this complexity unless explicit control is needed.

Reference: store.go:193-196, types.go:48-49


2. Error Handling & Resilience ⚠️

Context.Background() Usage (block/internal/executing/executor.go:416):

if err := e.store.Sync(context.Background()); err != nil {
    return fmt.Errorf("failed to sync store: %w", err)
}

Issues:

  1. Ignores cancellation signals: Using context.Background() means if e.ctx is cancelled during shutdown, the sync still blocks indefinitely
  2. No timeout protection: If BadgerDB's Sync hangs (disk I/O issues, filesystem problems), the executor halts permanently
  3. Inconsistent with codebase patterns: Other operations use e.ctx for proper lifecycle management

Recommendations:

// Option 1: Use executor context with timeout fallback
syncCtx := e.ctx
if err := e.store.Sync(syncCtx); err != nil {
    return fmt.Errorf("failed to sync store: %w", err)
}

// Option 2: Add timeout for safety (if sync must complete even on shutdown)
syncCtx, cancel := context.WithTimeout(context.Background(), 30*time.Second)
defer cancel()
if err := e.store.Sync(syncCtx); err != nil {
    return fmt.Errorf("failed to sync store: %w", err)
}

Error Propagation: Good - errors are properly wrapped with context and returned to halt block production on sync failure.


3. Security

No security concerns identified:

  • No credential handling in this change
  • No user input validation needed (internal operation)
  • Proper error handling prevents silent failures
  • The feature itself improves security (prevents double signing)

4. Performance & Resource Efficiency ⚠️

Critical Performance Impact:

The benchmarks reveal significant overhead:

Scenario Without Sync With Sync Overhead
Empty 12µs 43µs 3.6x slower
Small (10 txs) 14µs 66µs 4.7x slower
Large (1000 txs) 110µs 272µs 2.5x slower

Analysis:

  • BadgerDB uses a Write-Ahead Log (WAL) that provides crash recovery by default
  • The Sync() call forces an fsync() system call, flushing OS page cache to physical disk
  • This is necessary for true durability but comes at a steep cost

Tradeoff Considerations (addressing @tac0turtle's concern):

The current BadgerDB setup already has:

  1. ✅ WAL for crash recovery
  2. ✅ Memory table (MemTable) for buffering
  3. ❌ No explicit sync after commits (relies on WAL + background flush)

Without explicit Sync: Crash could lose the last committed block → node restarts, may double-sign if it doesn't detect the uncommitted block in WAL

With explicit Sync: Guarantees durability but reduces throughput by 60-80%

Potential Optimizations:

  1. Conditional Sync: Only sync every N blocks (e.g., every 10 blocks)
  2. Async Sync: Fire-and-forget sync in goroutine with monitoring
  3. Batched Sync: Accumulate multiple blocks, sync once
  4. Double-sign detection: Enhance restart logic to detect partial state in WAL

Reference: benchmark_test.go:1-71, executor.go:416


5. Testing & Reliability ⚠️

✅ Positives:

  • Excellent benchmark implementation covering empty/small/large blocks
  • Uses table-driven tests
  • Proper cleanup with b.Cleanup()
  • Benchmarks use realistic block data via types.GetRandomBlock()

❌ Missing Coverage:

  1. No unit tests for Sync() method itself - only benchmarks
  2. No error case testing: What happens when Sync() fails?
  3. No integration test: Does block production actually halt on sync failure?
  4. No concurrency tests: Can multiple syncs happen simultaneously?
  5. Coverage at 0%: All 4 new lines are untested in CI

Recommended Test Additions:

// Unit test for Sync success
func TestStore_Sync_Success(t *testing.T) { ... }

// Unit test for Sync failure  
func TestStore_Sync_Error(t *testing.T) { ... }

// Integration test: block production halts on sync failure
func TestExecutor_SyncFailure_HaltsProduction(t *testing.T) { ... }

Reference: benchmark_test.go, codecov report showing 0% patch coverage


6. Documentation & Developer Experience ⚠️

Missing Documentation:

  1. No godoc comment on the interface method (types.go:48-49):

    // Sync flushes the store state to disk
    Sync(ctx context.Context) error

    Should explain:

    • When to call this (after critical state changes)
    • Performance implications
    • Error handling expectations
    • Whether it's safe to call concurrently
  2. No inline comment explaining the safety vs. performance tradeoff at the call site (executor.go:416)

  3. Benchmark results are in PR description but not in code comments or docs

Recommendation:

// Sync forces a flush of all pending writes to durable storage.
// This is a blocking operation that calls fsync() on the underlying
// database, ensuring crash consistency at the cost of significant
// performance overhead (see benchmark_test.go for measurements).
//
// Call this after critical state transitions where data loss would
// cause safety violations (e.g., double signing).
//
// It is safe to call concurrently, but performance will degrade.
Sync(ctx context.Context) error

🏗️ Architectural Observations

Consistency with ev-node Patterns:

  1. Interface-first design: Properly added to Store interface in pkg/store
  2. Dependency injection: Executor receives Store interface, not concrete type
  3. Error propagation: Uses fmt.Errorf with %w for error wrapping
  4. Mock generation: Updated mocks via mockery
  5. ⚠️ Context usage: Should use executor's context, not context.Background()

📊 Priority Recommendations

High Priority (Before Merge):

  1. Fix context usage (executor.go:416): Use e.ctx or add timeout to prevent indefinite hangs
  2. Add unit tests: Test both success and failure paths for Sync()
  3. Document the interface method: Explain when/why to call and performance implications

Medium Priority (Consider for this PR):

  1. Decide on API design: Should this be part of Commit() instead of separate?
  2. Add integration test: Verify block production halts on sync failure
  3. Benchmark on target hardware: ARM64 macOS results may not reflect production Linux performance

Low Priority (Future Work):

  1. Explore optimization strategies: Async sync, batched sync, or conditional sync
  2. Add metrics: Track sync duration and failures
  3. Consider configurability: Let operators choose safety vs. performance

💭 Final Thoughts

This PR addresses a real safety concern (double signing after crash), and the implementation is clean and straightforward. However, the 2-5x performance penalty is substantial and warrants careful consideration.

Key Questions for the Team:

  1. Is the performance cost acceptable? What's the target block time and throughput?
  2. Does BadgerDB's WAL provide sufficient safety without explicit sync? (May need deeper analysis of BadgerDB crash recovery guarantees)
  3. Should sync be configurable? Allow operators to choose based on their risk tolerance?
  4. Can we detect uncommitted blocks on restart and avoid re-proposing rather than forcing sync? (Defensive approach at startup vs. expensive sync at runtime)

The code is production-ready from a correctness standpoint but needs:

  • Context handling fix
  • Test coverage
  • Documentation
  • Team decision on performance tradeoffs

Files Reviewed:

  • block/internal/executing/executor.go:416-418 ⚠️ Context usage issue
  • pkg/store/store.go:193-196 ✅ Clean implementation
  • pkg/store/types.go:48-49 ⚠️ Needs documentation
  • pkg/store/benchmark_test.go:1-71 ✅ Excellent benchmarks
  • test/mocks/store.go ✅ Properly generated

@alpe alpe changed the title Sync store on new block chore: sync store on new block Nov 20, 2025
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a Sync operation after each block is committed to the store. This is a crucial change to ensure data is persisted to disk immediately, preventing potential state inconsistencies or double signing in case of a crash. The implementation adds a Sync method to the Store interface and calls it in the block production flow. My review focused on the correctness of this implementation and its implications. I have one comment regarding the use of context.Background() for the sync operation, highlighting a trade-off between data integrity on shutdown and the risk of hanging the shutdown process.

if err := batch.Commit(); err != nil {
return fmt.Errorf("failed to commit batch: %w", err)
}
if err := e.store.Sync(context.Background()); err != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Using context.Background() here ensures the Sync operation completes even during a graceful shutdown, which is important for data integrity. However, this comes with a trade-off: if Sync were to hang, it would block the shutdown process indefinitely because it's not cancellable. While this might be an acceptable risk, a more robust solution could involve using a context with a timeout for this critical operation. This would prevent indefinite hangs while still giving the sync operation a grace period to complete during shutdown.

@codecov
Copy link

codecov bot commented Nov 20, 2025

Codecov Report

❌ Patch coverage is 0% with 4 lines in your changes missing coverage. Please review.
✅ Project coverage is 67.23%. Comparing base (faabb32) to head (6efb020).
⚠️ Report is 12 commits behind head on main.

Files with missing lines Patch % Lines
block/internal/executing/executor.go 0.00% 1 Missing and 1 partial ⚠️
pkg/store/store.go 0.00% 2 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #2869      +/-   ##
==========================================
+ Coverage   64.94%   67.23%   +2.29%     
==========================================
  Files          81       82       +1     
  Lines        7262     8087     +825     
==========================================
+ Hits         4716     5437     +721     
- Misses       2001     2074      +73     
- Partials      545      576      +31     
Flag Coverage Δ
combined 67.23% <0.00%> (+2.29%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

if err := batch.Commit(); err != nil {
return fmt.Errorf("failed to commit batch: %w", err)
}
if err := e.store.Sync(context.Background()); err != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how much does this cost in terms of overhead?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a fair question. I have not benchmarked this as I can only provide random data which may be very different to what happens on a chain.
But the question is more safety vs speed. Without sync, there is a risk for double signing after kill/restart.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should benchmark this. if on restart we are losing the previous block and reproposing a new block then i think its an issue in how we handle the flow, taking a performance hit like this is something to be careful about. Currently we are using the badgerdb system with its WAL and memory store before this change.

Copy link
Contributor Author

@alpe alpe Nov 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have added a benchmark and updated the description with some results. You were right to ask for the benchmark. 👍
If the numbers are not acceptable for the use case, we could find some trade off to balance safety for speed with async call to sync (😄 ) or a sync every n blocks for example.

return data, nil
}

// Sync flushes the store state to disk
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this be abstracted within Commit? Keeping this method out of the interface, and just have Commit handle saving.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good discussion point. I don't mind the additional method on the interface but depending on the internal implementation less data piles up at the block production.

@tac0turtle
Copy link
Contributor

discussion in standup was we dont need this, can we close it

@tac0turtle tac0turtle closed this Nov 24, 2025
@tac0turtle tac0turtle reopened this Nov 24, 2025
@tac0turtle
Copy link
Contributor

sorry didnt mean to close

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants