-
Notifications
You must be signed in to change notification settings - Fork 238
chore: sync store on new block #2869
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Summary of ChangesHello @alpe, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request introduces a critical change to enhance data persistence by ensuring that the store's state is explicitly synced to disk after every block is processed and committed. This measure is designed to prevent potential data inconsistencies and issues like double signing that could arise from unpersisted state, with the understanding that there might be an impact on overall performance. Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
|
The latest Buf updates on your PR. Results from workflow CI / buf-check (pull_request).
|
|
Claude finished @alpe's task —— View job Comprehensive Code ReviewI've completed a thorough review of the store sync implementation. This PR adds a critical safety feature to prevent double signing by ensuring state is persisted to disk after each block. Below is my detailed analysis: 🎯 Overview AssessmentPurpose: Add explicit Performance Impact: Benchmarks show 2-3x slowdown (empty: 12µs → 43µs, small: 14µs → 66µs, large: 110µs → 272µs). This is a significant tradeoff between safety and throughput. 🔍 Detailed Findings1. Code Quality & Architecture
|
| Scenario | Without Sync | With Sync | Overhead |
|---|---|---|---|
| Empty | 12µs | 43µs | 3.6x slower |
| Small (10 txs) | 14µs | 66µs | 4.7x slower |
| Large (1000 txs) | 110µs | 272µs | 2.5x slower |
Analysis:
- BadgerDB uses a Write-Ahead Log (WAL) that provides crash recovery by default
- The
Sync()call forces anfsync()system call, flushing OS page cache to physical disk - This is necessary for true durability but comes at a steep cost
Tradeoff Considerations (addressing @tac0turtle's concern):
The current BadgerDB setup already has:
- ✅ WAL for crash recovery
- ✅ Memory table (MemTable) for buffering
- ❌ No explicit sync after commits (relies on WAL + background flush)
Without explicit Sync: Crash could lose the last committed block → node restarts, may double-sign if it doesn't detect the uncommitted block in WAL
With explicit Sync: Guarantees durability but reduces throughput by 60-80%
Potential Optimizations:
- Conditional Sync: Only sync every N blocks (e.g., every 10 blocks)
- Async Sync: Fire-and-forget sync in goroutine with monitoring
- Batched Sync: Accumulate multiple blocks, sync once
- Double-sign detection: Enhance restart logic to detect partial state in WAL
Reference: benchmark_test.go:1-71, executor.go:416
5. Testing & Reliability ⚠️
✅ Positives:
- Excellent benchmark implementation covering empty/small/large blocks
- Uses table-driven tests
- Proper cleanup with
b.Cleanup() - Benchmarks use realistic block data via
types.GetRandomBlock()
❌ Missing Coverage:
- No unit tests for
Sync()method itself - only benchmarks - No error case testing: What happens when
Sync()fails? - No integration test: Does block production actually halt on sync failure?
- No concurrency tests: Can multiple syncs happen simultaneously?
- Coverage at 0%: All 4 new lines are untested in CI
Recommended Test Additions:
// Unit test for Sync success
func TestStore_Sync_Success(t *testing.T) { ... }
// Unit test for Sync failure
func TestStore_Sync_Error(t *testing.T) { ... }
// Integration test: block production halts on sync failure
func TestExecutor_SyncFailure_HaltsProduction(t *testing.T) { ... }Reference: benchmark_test.go, codecov report showing 0% patch coverage
6. Documentation & Developer Experience ⚠️
Missing Documentation:
-
No godoc comment on the interface method (types.go:48-49):
// Sync flushes the store state to disk Sync(ctx context.Context) error
Should explain:
- When to call this (after critical state changes)
- Performance implications
- Error handling expectations
- Whether it's safe to call concurrently
-
No inline comment explaining the safety vs. performance tradeoff at the call site (executor.go:416)
-
Benchmark results are in PR description but not in code comments or docs
Recommendation:
// Sync forces a flush of all pending writes to durable storage.
// This is a blocking operation that calls fsync() on the underlying
// database, ensuring crash consistency at the cost of significant
// performance overhead (see benchmark_test.go for measurements).
//
// Call this after critical state transitions where data loss would
// cause safety violations (e.g., double signing).
//
// It is safe to call concurrently, but performance will degrade.
Sync(ctx context.Context) error🏗️ Architectural Observations
Consistency with ev-node Patterns:
- ✅ Interface-first design: Properly added to
Storeinterface in pkg/store - ✅ Dependency injection: Executor receives Store interface, not concrete type
- ✅ Error propagation: Uses
fmt.Errorfwith%wfor error wrapping - ✅ Mock generation: Updated mocks via mockery
⚠️ Context usage: Should use executor's context, notcontext.Background()
📊 Priority Recommendations
High Priority (Before Merge):
- Fix context usage (executor.go:416): Use
e.ctxor add timeout to prevent indefinite hangs - Add unit tests: Test both success and failure paths for
Sync() - Document the interface method: Explain when/why to call and performance implications
Medium Priority (Consider for this PR):
- Decide on API design: Should this be part of
Commit()instead of separate? - Add integration test: Verify block production halts on sync failure
- Benchmark on target hardware: ARM64 macOS results may not reflect production Linux performance
Low Priority (Future Work):
- Explore optimization strategies: Async sync, batched sync, or conditional sync
- Add metrics: Track sync duration and failures
- Consider configurability: Let operators choose safety vs. performance
💭 Final Thoughts
This PR addresses a real safety concern (double signing after crash), and the implementation is clean and straightforward. However, the 2-5x performance penalty is substantial and warrants careful consideration.
Key Questions for the Team:
- Is the performance cost acceptable? What's the target block time and throughput?
- Does BadgerDB's WAL provide sufficient safety without explicit sync? (May need deeper analysis of BadgerDB crash recovery guarantees)
- Should sync be configurable? Allow operators to choose based on their risk tolerance?
- Can we detect uncommitted blocks on restart and avoid re-proposing rather than forcing sync? (Defensive approach at startup vs. expensive sync at runtime)
The code is production-ready from a correctness standpoint but needs:
- Context handling fix
- Test coverage
- Documentation
- Team decision on performance tradeoffs
Files Reviewed:
- block/internal/executing/executor.go:416-418
⚠️ Context usage issue - pkg/store/store.go:193-196 ✅ Clean implementation
- pkg/store/types.go:48-49
⚠️ Needs documentation - pkg/store/benchmark_test.go:1-71 ✅ Excellent benchmarks
- test/mocks/store.go ✅ Properly generated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a Sync operation after each block is committed to the store. This is a crucial change to ensure data is persisted to disk immediately, preventing potential state inconsistencies or double signing in case of a crash. The implementation adds a Sync method to the Store interface and calls it in the block production flow. My review focused on the correctness of this implementation and its implications. I have one comment regarding the use of context.Background() for the sync operation, highlighting a trade-off between data integrity on shutdown and the risk of hanging the shutdown process.
| if err := batch.Commit(); err != nil { | ||
| return fmt.Errorf("failed to commit batch: %w", err) | ||
| } | ||
| if err := e.store.Sync(context.Background()); err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using context.Background() here ensures the Sync operation completes even during a graceful shutdown, which is important for data integrity. However, this comes with a trade-off: if Sync were to hang, it would block the shutdown process indefinitely because it's not cancellable. While this might be an acceptable risk, a more robust solution could involve using a context with a timeout for this critical operation. This would prevent indefinite hangs while still giving the sync operation a grace period to complete during shutdown.
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #2869 +/- ##
==========================================
+ Coverage 64.94% 67.23% +2.29%
==========================================
Files 81 82 +1
Lines 7262 8087 +825
==========================================
+ Hits 4716 5437 +721
- Misses 2001 2074 +73
- Partials 545 576 +31
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
| if err := batch.Commit(); err != nil { | ||
| return fmt.Errorf("failed to commit batch: %w", err) | ||
| } | ||
| if err := e.store.Sync(context.Background()); err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how much does this cost in terms of overhead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a fair question. I have not benchmarked this as I can only provide random data which may be very different to what happens on a chain.
But the question is more safety vs speed. Without sync, there is a risk for double signing after kill/restart.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should benchmark this. if on restart we are losing the previous block and reproposing a new block then i think its an issue in how we handle the flow, taking a performance hit like this is something to be careful about. Currently we are using the badgerdb system with its WAL and memory store before this change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have added a benchmark and updated the description with some results. You were right to ask for the benchmark. 👍
If the numbers are not acceptable for the use case, we could find some trade off to balance safety for speed with async call to sync (😄 ) or a sync every n blocks for example.
| return data, nil | ||
| } | ||
|
|
||
| // Sync flushes the store state to disk |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't this be abstracted within Commit? Keeping this method out of the interface, and just have Commit handle saving.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a good discussion point. I don't mind the additional method on the interface but depending on the internal implementation less data piles up at the block production.
|
discussion in standup was we dont need this, can we close it |
|
sorry didnt mean to close |
Overview
After each block, the store is synced to ensure the state is persisted. This is key to prevent double signing on dirty state.
I would expect cost on overal performance to do this but I don't have numbers.
Pulled from #2836
Some benchmark results on my box: