-
Notifications
You must be signed in to change notification settings - Fork 547
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add alerts for stuck at catchup and bootstrap #14204
add alerts for stuck at catchup and bootstrap #14204
Conversation
!ci-build-me |
@@ -255,6 +255,24 @@ groups: | |||
description: "{{ $value }} blocks have been validated on network {{ $labels.testnet }} in the last hour (according to some node)." | |||
runbook: "https://www.notion.so/minaprotocol/FewBlocksPerHour-47a6356f093242d988b0d9527ce23478" | |||
|
|||
- alert: StuckInBootstrap | |||
expr: count by (testnet) (increase(Coda_Runtime_process_uptime_ms_total{syncStatus = "BOOTSTRAP"}[2h]) >= 7200000) > 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because increase
only computes against the latest monotonic sequence of samples within the specified range, this alert will not trigger if a node is repeatedly crashing and restarting into bootstrap. That's fine so long as we have another alert that is capturing that. Do you know of a alert that would trigger if the nodes are repeatedly restarting into bootstrap?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nope. There's no such alert for now. Once the daemon is restarted, I believe it would start with a new node/instance in gcloud (this is just my experience with grafana charts). So if we want to capture the data during multiple restart we need to use the label app
to capture a specific pod. For now I don't know a good way to automatically collect the data from different instances of one pod together. I wish there is a group by
in prometheus query language. I would do a research on that.
!ci-build-me |
!ci-build-me |
!ci-build-me |
!ci-build-me |
!ci-build-me |
!ci-build-me |
!ci-build-me |
!ci-build-me |
!approved-for-mainnet |
Explain your changes:
This PR adds alerts for nodes stuck at bootstrap and catchup
Explain how you tested your changes:
*
Checklist: