Skip to content

Conversation

@tac0turtle
Copy link
Contributor

This PR increases the default ReadinessMaxBlocksBehind value from 3 to 30 blocks to avoid false-negatives during normal p2p sync operations.

Problem

The current default of 3 blocks marks most healthy nodes as "not ready" because they are routinely >3 blocks behind best-head while the p2p sync executes blocks in batches. Observability tools that scrape /health/ready every 5s therefore report the node as permanently offline.

Solution

Raised the default to 30 blocks (≈ 15s worth of blocks with 0.5s block time). This keeps the safety property ("stuck node is detected") while removing the false-positive during normal batch-sync.

Changes

  • Updated pkg/config/defaults.go to change ReadinessMaxBlocksBehind from 3 to 30

Fixes #2778

Generated with Claude Code

The previous default of 3 blocks caused false-negatives during normal
p2p sync operations, where nodes routinely lag by 6-7 seconds (12-14 blocks
at 500ms block time) while executing blocks in batches.

This change raises the default to 30 blocks (15s worth of blocks with 0.5s
block time), which maintains the ability to detect stuck nodes while
eliminating false-positives during normal batch-sync operations.

Fixes #2778

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
Co-authored-by: Marko <tac0turtle@users.noreply.github.com>
@claude
Copy link
Contributor

claude bot commented Oct 23, 2025

Claude finished @tac0turtle's task —— View job


Code Review Complete ✅

Summary

This PR increases the default ReadinessMaxBlocksBehind from 3 to 30 blocks to address false-negative health checks during normal p2p sync operations. The change is well-justified and correct, but requires one test update to maintain test coverage.


Detailed Review

1. Code Quality & Architecture ✅

Strengths:

  • Change is minimal, focused, and follows idiomatic Go patterns
  • Located in the appropriate defaults configuration file
  • No cyclic dependencies or architectural concerns introduced
  • The value is properly typed as uint64, consistent with the field definition in config.go:201

Observations:

  • The configuration is well-structured with clear separation between defaults and runtime configuration
  • The field has proper YAML mapping and helpful inline documentation in config.go:201

2. Correctness & Implementation ✅

Change Analysis:

// Before: ReadinessMaxBlocksBehind: 3,
// After:  ReadinessMaxBlocksBehind: 30,

Rationale Assessment:

  • Problem identified: 3-block threshold causes false-negatives during batch sync (6-7s lag = 12-14 blocks at 500ms block time)
  • Solution: 30 blocks ≈ 15s worth of blocks (at 0.5s block time)
  • Safety preserved: Still detects stuck nodes while eliminating false-positives

Usage in codebase (pkg/rpc/server/http.go:66-77):

allowedBlocksBehind := cfg.Node.ReadinessMaxBlocksBehind
if bestKnownHeight-localHeight > allowedBlocksBehind {
    http.Error(w, "UNREADY: behind best-known head", http.StatusServiceUnavailable)
    return
}

The implementation correctly uses this value to determine node readiness. ✅

3. Testing ⚠️ ACTION REQUIRED

Issue: Test case needs updating

In pkg/rpc/server/server_test.go:413, there's a test case that will now pass when it should fail:

{name: "just_over_3_blocks", local: 96, bestKnown: 100, peers: 1, expectedCode: http.StatusServiceUnavailable},

Impact: This test expects a node 4 blocks behind to be marked as UNREADY. With the new default of 30, this node would be marked as READY, causing the test to fail.

Recommendation:
Update the test case to reflect the new default:

{name: "just_over_30_blocks", local: 69, bestKnown: 100, peers: 1, expectedCode: http.StatusServiceUnavailable},

Also consider updating the "within_3_blocks" test name:

{name: "within_30_blocks", local: 70, bestKnown: 100, peers: 1, expectedCode: http.StatusOK},

File reference: pkg/rpc/server/server_test.go:402-418

4. Security ✅

No security concerns:

  • Configuration value is bounds-checked at usage site
  • No credentials or sensitive data involved
  • No new attack surface introduced

5. Performance & Resource Efficiency ✅

Positive impact:

  • Reduces false alarms in monitoring systems
  • Prevents unnecessary alerts to on-call teams
  • No runtime performance impact - simple integer comparison remains unchanged

Operational benefits:

  • Health check endpoints (/health/ready) will report more accurate node state
  • Monitoring tools scraping every 5s will see nodes as healthy during normal batch sync

6. Documentation ✅

Current state:

  • Field is well-documented in config.go:201 with clear description
  • CLI flag has helpful description in config.go:317
  • Comment explains the configuration purpose

Recommendations:

  • ✅ No additional documentation needed
  • The PR description provides excellent context for the change
  • The inline comment in config.go:201 remains accurate

Recommendations

Required Changes

  1. Update test case in pkg/rpc/server/server_test.go:413:

    -   {name: "just_over_3_blocks", local: 96, bestKnown: 100, peers: 1, expectedCode: http.StatusServiceUnavailable},
    +   {name: "just_over_30_blocks", local: 69, bestKnown: 100, peers: 1, expectedCode: http.StatusServiceUnavailable},
  2. Update test case in pkg/rpc/server/server_test.go:412 (optional but recommended for clarity):

    -   {name: "within_3_blocks", local: 97, bestKnown: 100, peers: 1, expectedCode: http.StatusOK},
    +   {name: "within_30_blocks", local: 70, bestKnown: 100, peers: 1, expectedCode: http.StatusOK},

Optional Enhancements

  1. Add test case for edge case at exactly 30 blocks behind:

    {name: "exactly_30_blocks_behind", local: 70, bestKnown: 100, peers: 1, expectedCode: http.StatusOK},
  2. Consider adding a comment in defaults.go explaining the rationale:

    // 30 blocks ≈ 15s at 500ms block time; allows for normal batch-sync latency
    ReadinessMaxBlocksBehind: 30,

Configuration Impact Analysis

Affected users:

  • Existing deployments using the default value will automatically benefit from reduced false-negatives
  • Users who explicitly configured ReadinessMaxBlocksBehind via CLI flag or config file are not affected
  • This is a backwards-compatible change that improves default behavior

Migration path:

  • No migration needed
  • Operators who prefer the stricter 3-block threshold can override via:
    • CLI: --evnode.node.readiness_max_blocks_behind=3
    • Config: node.readiness_max_blocks_behind: 3

Conclusion

Approval Status:Approve with minor test update required

This is a well-thought-out change that addresses a real operational problem. The single-line change is correct and properly justified. Once the test case is updated, this PR will be ready to merge.

Estimated effort to address feedback: 2-5 minutes


Files Reviewed

  • pkg/config/defaults.go (modified)
  • pkg/config/config.go (reviewed for context)
  • pkg/rpc/server/http.go (reviewed usage)
  • ⚠️ pkg/rpc/server/server_test.go (needs update)
  • pkg/config/config_test.go (reviewed - no changes needed)

@github-actions
Copy link
Contributor

github-actions bot commented Oct 23, 2025

The latest Buf updates on your PR. Results from workflow CI and Release / buf-check (pull_request).

BuildFormatLintBreakingUpdated (UTC)
✅ passed⏩ skipped✅ passed✅ passedOct 23, 2025, 12:08 PM

@tac0turtle tac0turtle marked this pull request as ready for review October 23, 2025 11:31
@auricom
Copy link
Contributor

auricom commented Oct 23, 2025

@claude instead of hardcoding a value, please use a dynamic one based of config.blocktime
result should be the equivalent of 15seconds

@claude
Copy link
Contributor

claude bot commented Oct 23, 2025

Claude finished @auricom's task —— View job


Implementation Complete ✅

Successfully made ReadinessMaxBlocksBehind dynamic based on config.blocktime to represent 15 seconds worth of blocks.

Task List

  • Read config files to understand blocktime configuration
  • Determine the appropriate approach for dynamic calculation
  • Implement the calculation (15s / blocktime)
  • Update tests to reflect the change
  • Commit and push changes

Implementation Details

Dynamic Calculation:

  • Created calculateReadinessMaxBlocksBehind() helper function in pkg/config/defaults.go
  • Formula: 15 seconds / blockTime = blocks behind threshold
  • Includes fallback to 30 if blockTime is 0 (edge case protection)

Scaling Behavior:
The threshold now scales automatically with different block times:

  • 0.5s block time → 30 blocks (same as before for testnet scenarios)
  • 1.0s block time → 15 blocks (new default)
  • 2.0s block time → 7 blocks (scales appropriately)

This ensures the 15-second window for normal batch-sync latency is maintained regardless of the configured block time, while still detecting stuck nodes.

Test Updates:
Updated pkg/rpc/server/server_test.go:

  • within_3_blockswithin_15_blocks (local: 85, bestKnown: 100)
  • just_over_3_blocksjust_over_15_blocks (local: 84, bestKnown: 100)

Tests now reflect the new default of 15 blocks for the 1-second block time.

Changed Files

  • pkg/config/defaults.go: Added calculation function and updated default config
  • pkg/rpc/server/server_test.go: Updated test cases to match new default

View Job • Branch: claude/issue-2778-20251023-1120

Instead of hardcoding ReadinessMaxBlocksBehind to 30, calculate it
dynamically as 15 seconds worth of blocks based on the configured
block time. This ensures the threshold scales appropriately:

- 0.5s block time → 30 blocks (same as before for testnet)
- 1.0s block time → 15 blocks (new default)
- 2.0s block time → 7 blocks (scales down appropriately)

The 15-second window allows for normal batch-sync latency while
still detecting stuck nodes, regardless of block time configuration.

Changes:
- Added calculateReadinessMaxBlocksBehind() helper function
- Updated DefaultConfig() to use calculated value
- Updated tests to reflect new default (15 blocks for 1s block time)
- Includes fallback to 30 if block time is 0

Co-authored-by: auricom <auricom@users.noreply.github.com>
@codecov
Copy link

codecov bot commented Oct 23, 2025

Codecov Report

❌ Patch coverage is 57.14286% with 6 lines in your changes missing coverage. Please review.
✅ Project coverage is 61.55%. Comparing base (c6dd42c) to head (8394b47).
⚠️ Report is 2 commits behind head on main.

Files with missing lines Patch % Lines
pkg/config/defaults.go 53.84% 4 Missing and 2 partials ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #2779      +/-   ##
==========================================
- Coverage   61.57%   61.55%   -0.02%     
==========================================
  Files          82       82              
  Lines        8715     8727      +12     
==========================================
+ Hits         5366     5372       +6     
- Misses       2845     2849       +4     
- Partials      504      506       +2     
Flag Coverage Δ
combined 61.55% <57.14%> (-0.02%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@auricom
Copy link
Contributor

auricom commented Oct 23, 2025

@claude do not hardcode 15seconds, make this value part of config so that a user can override our default value

@claude
Copy link
Contributor

claude bot commented Oct 23, 2025

Claude finished @auricom's task —— View job


Implementation Complete ✅

Successfully made the readiness window configurable instead of hardcoded to 15 seconds.

Task List

  • Read current implementation to understand structure
  • Add new config field for readiness window duration
  • Update defaults to include the new field (15s default)
  • Modify calculation to use configurable value
  • Update tests to verify changes
  • Commit and push changes

Implementation Details

New Configuration Field:

  • Added ReadinessWindowSeconds to NodeConfig (default: 15 seconds)
  • Users can now override this value via CLI flag or YAML config:
    • CLI: --evnode.node.readiness_window_seconds=30
    • YAML: node.readiness_window_seconds: 30

Dynamic Calculation:
The calculateReadinessMaxBlocksBehind() function now accepts both blockTime and windowSeconds parameters:

  • Formula: windowSeconds / blockTime = blocks behind threshold
  • Includes fallbacks for edge cases (0 values)

Behavior Examples:
With default 15-second window:

  • 0.5s block time → 30 blocks
  • 1.0s block time → 15 blocks
  • 2.0s block time → 7 blocks

With custom 30-second window:

  • 0.5s block time → 60 blocks
  • 1.0s block time → 30 blocks
  • 2.0s block time → 15 blocks

Test Updates:

  • Added test assertion for the new flag
  • Updated expected flag count from 38 to 39

Changed Files

  • pkg/config/config.go: Added field definition, CLI flag constant, and flag registration
  • pkg/config/defaults.go: Updated calculation function and default config
  • pkg/config/config_test.go: Added test for new flag

View Job • Branch: claude/issue-2778-20251023-1120

Added ReadinessWindowSeconds config field (default: 15s) to allow users
to override the time window used for calculating ReadinessMaxBlocksBehind
based on block time. This provides flexibility while maintaining the
dynamic calculation based on block time.

Changes:
- Added ReadinessWindowSeconds field to NodeConfig
- Updated calculateReadinessMaxBlocksBehind to accept window parameter
- Added CLI flag and default value (15 seconds)
- Updated tests to verify new configuration

Co-authored-by: auricom <auricom@users.noreply.github.com>
@tac0turtle tac0turtle requested a review from auricom October 23, 2025 13:33
@auricom
Copy link
Contributor

auricom commented Oct 24, 2025

I tried to deploy it locally to check if that works. but i cannot sync the local node as main is broken, so readiness is always offline.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

[FEATURE] increase default ReadinessMaxBlocksBehind to avoid false-negatives

2 participants