Skip to content

metrics: fix ResettingSample Prometheus _count monotonicity#2174

Merged
manav2401 merged 4 commits into0xPolygon:developfrom
lake-dunamu:fix/prometheus-counter-monotonicity
Apr 6, 2026
Merged

metrics: fix ResettingSample Prometheus _count monotonicity#2174
manav2401 merged 4 commits into0xPolygon:developfrom
lake-dunamu:fix/prometheus-counter-monotonicity

Conversation

@lake-dunamu
Copy link
Copy Markdown
Contributor

@lake-dunamu lake-dunamu commented Mar 31, 2026

Description

ResettingSample.Snapshot() calls Clear() which resets count and sum to 0 on every Prometheus scrape. This causes _count metrics (declared as Prometheus counter type) to decrease, violating counter monotonicity and breaking rate(), increase(), and average latency calculations.

Fixed by maintaining cumulative count and sum in resettingSample, while keeping the resetting behavior for sample values used in percentile calculations.

fixes #2173

Changes

  • Bugfix (non-breaking change that solves an issue)
  • Hotfix (change that solves an urgent issue, and requires immediate attention)
  • New feature (non-breaking change that adds functionality)
  • Breaking change (change that is not backwards-compatible and/or changes current functionality)
  • Changes only for a subset of nodes

Breaking changes

N/A

Nodes audience

N/A

Checklist

  • I have added at least 2 reviewer or the whole pos-v1 team
  • I have added sufficient documentation in code
  • I will be resolving comments - if any - by pushing each fix in a separate commit and linking the commit hash in the comment reply
  • Created a task in Jira and informed the team for implementation in Erigon client (if applicable)
  • Includes RPC methods changes, and the Notion documentation has been updated

Cross repository changes

  • This PR requires changes to heimdall
  • This PR requires changes to matic-cli

Testing

  • I have added unit tests
  • I have tested this code manually on local environment
  • I have added tests to CI
  • I have tested this code manually on remote devnet using express-cli
  • I have tested this code manually on amoy
  • I have created new e2e tests into express-cli

Additional comments

Affected metrics: all histograms using ResettingSample, including rpc_duration_* (RPC latency), P2P tracking, and protocol handler metrics.

Copilot AI review requested due to automatic review settings March 31, 2026 12:56
Copy link
Copy Markdown

@claude claude bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude Code Review

This pull request is from a fork — automated review is disabled. A repository maintainer can comment @claude review to run a one-time review.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR aims to fix Prometheus counter monotonicity issues caused by ResettingSample.Snapshot() clearing the underlying sample on each scrape, by introducing cumulative count and sum tracking in resettingSample while keeping interval-only sample values for percentile calculations.

Changes:

  • Add cumulative count and sum fields to resettingSample.
  • Update Snapshot() to accumulate count/sum across scrapes and return a snapshot using these cumulative totals.
  • Keep percentile inputs (Values(), Min(), Max()) interval-scoped by clearing the wrapped sample after snapshotting.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@manav2401
Copy link
Copy Markdown
Member

@lake-dunamu thanks for the PR. Few comments.

  1. sum isn't tracked in histogram and only count is tracked so it's okay to remove it from resetting sample.
  2. Please use atomic.Int64 for count so that there are no race conditions.
  3. When newSampleSnapshotPrecalculated is called, rs.sum is passed which is the total sum. This will affect the mean as we're using total sum instead of sum in this sample. You'll have to replace it with s.Sum().
  4. Please add some basic tests to validate the same.

Let me know if you need help or have any concerns. Thanks!

@lake-dunamu
Copy link
Copy Markdown
Contributor Author

@manav2401

Thanks for the review.

addressed:

  1. Removed sum field — not needed in histogram.
  2. count now uses atomic.Int64.
  3. Passing s.Sum() instead of rs.sum to fix mean calculation.
  4. Added basic tests.

@lake-dunamu
Copy link
Copy Markdown
Contributor Author

@manav2401

could you PTAL?

Copilot AI review requested due to automatic review settings April 3, 2026 07:00
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

Comments suppressed due to low confidence (1)

metrics/resetting_sample.go:8

  • Doc comment has typos/grammar issues (e.g., “this ensure”, “skew th charts”). Please correct wording to improve clarity of exported API docs.
// ResettingSample converts an ordinary sample into one that resets whenever its
// snapshot is retrieved. This will break for multi-monitor systems, but when only
// a single metric is being pushed out, this ensure that low-frequency events don't
// skew th charts indefinitely.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@sonarqubecloud
Copy link
Copy Markdown

sonarqubecloud bot commented Apr 3, 2026

@claude
Copy link
Copy Markdown

claude bot commented Apr 4, 2026

Code review

No issues found. Checked for bugs and CLAUDE.md compliance.

@manav2401 manav2401 merged commit 56cd8ad into 0xPolygon:develop Apr 6, 2026
14 of 17 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Prometheus metrics violate counter monotonicity convention: _count and _sum reset on every scrape due to ResettingSample

5 participants