Skip to content

add alerts for stuck at catchup and bootstrap #14204

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 10 commits into from
Oct 18, 2023
Merged
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -255,6 +255,24 @@ groups:
description: "{{ $value }} blocks have been validated on network {{ $labels.testnet }} in the last hour (according to some node)."
runbook: "https://www.notion.so/minaprotocol/FewBlocksPerHour-47a6356f093242d988b0d9527ce23478"

- alert: StuckInBootstrap
expr: count by (testnet) (increase(Coda_Runtime_process_uptime_ms_total{syncStatus = "BOOTSTRAP"}[2h]) >= 7200000) > 0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because increase only computes against the latest monotonic sequence of samples within the specified range, this alert will not trigger if a node is repeatedly crashing and restarting into bootstrap. That's fine so long as we have another alert that is capturing that. Do you know of a alert that would trigger if the nodes are repeatedly restarting into bootstrap?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nope. There's no such alert for now. Once the daemon is restarted, I believe it would start with a new node/instance in gcloud (this is just my experience with grafana charts). So if we want to capture the data during multiple restart we need to use the label app to capture a specific pod. For now I don't know a good way to automatically collect the data from different instances of one pod together. I wish there is a group by in prometheus query language. I would do a research on that.

for: ${alert_evaluation_duration}
labels:
testnet: "{{ $labels.testnet }}"
severity: critical
annotations:
summary: "One or more {{ $labels.testnet }} nodes are stuck at bootstrap for more than 2 hours"

- alert: StuckInCatchup
expr: count by (testnet) (increase(Coda_Runtime_process_uptime_ms_total{syncStatus = "CATCHUP"}[2h]) >= 7200000) > 0
for: ${alert_evaluation_duration}
labels:
testnet: "{{ $labels.testnet }}"
severity: critical
annotations:
summary: "One or more {{ $labels.testnet }} nodes are stuck at catchup for more than 2 hours"


- name: Warnings
rules:
Expand Down