metrics: fix ResettingSample Prometheus _count monotonicity#2174
Conversation
There was a problem hiding this comment.
Pull request overview
This PR aims to fix Prometheus counter monotonicity issues caused by ResettingSample.Snapshot() clearing the underlying sample on each scrape, by introducing cumulative count and sum tracking in resettingSample while keeping interval-only sample values for percentile calculations.
Changes:
- Add cumulative
countandsumfields toresettingSample. - Update
Snapshot()to accumulate count/sum across scrapes and return a snapshot using these cumulative totals. - Keep percentile inputs (
Values(),Min(),Max()) interval-scoped by clearing the wrapped sample after snapshotting.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
@lake-dunamu thanks for the PR. Few comments.
Let me know if you need help or have any concerns. Thanks! |
|
Thanks for the review. addressed:
|
|
could you PTAL? |
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.
Comments suppressed due to low confidence (1)
metrics/resetting_sample.go:8
- Doc comment has typos/grammar issues (e.g., “this ensure”, “skew th charts”). Please correct wording to improve clarity of exported API docs.
// ResettingSample converts an ordinary sample into one that resets whenever its
// snapshot is retrieved. This will break for multi-monitor systems, but when only
// a single metric is being pushed out, this ensure that low-frequency events don't
// skew th charts indefinitely.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
Code reviewNo issues found. Checked for bugs and CLAUDE.md compliance. |



Description
ResettingSample.Snapshot()callsClear()which resets count and sum to 0 on every Prometheus scrape. This causes_countmetrics (declared as Prometheuscountertype) to decrease, violating counter monotonicity and breakingrate(),increase(), and average latency calculations.Fixed by maintaining cumulative count and sum in
resettingSample, while keeping the resetting behavior for sample values used in percentile calculations.fixes #2173
Changes
Breaking changes
N/A
Nodes audience
N/A
Checklist
Cross repository changes
Testing
Additional comments
Affected metrics: all histograms using
ResettingSample, includingrpc_duration_*(RPC latency), P2P tracking, and protocol handler metrics.