Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add alerts for stuck at catchup and bootstrap #14204

Merged
merged 10 commits into from
Oct 18, 2023

Conversation

ghost-not-in-the-shell
Copy link
Contributor

Explain your changes:
This PR adds alerts for nodes stuck at bootstrap and catchup

Explain how you tested your changes:
*

Checklist:

  • Dependency versions are unchanged
    • Notify Velocity team if dependencies must change in CI
  • Modified the current draft of release notes with details on what is completed or incomplete within this project
  • Document code purpose, how to use it
    • Mention expected invariants, implicit constraints
  • Tests were added for the new behavior
    • Document test purpose, significance of failures
    • Test names should reflect their purpose
  • All tests pass (CI will check this if you didn't)
  • Serialized types are in stable-versioned modules
  • Does this close issues? List them
  • Closes #0000

@ghost-not-in-the-shell
Copy link
Contributor Author

!ci-build-me

@@ -255,6 +255,24 @@ groups:
description: "{{ $value }} blocks have been validated on network {{ $labels.testnet }} in the last hour (according to some node)."
runbook: "https://www.notion.so/minaprotocol/FewBlocksPerHour-47a6356f093242d988b0d9527ce23478"

- alert: StuckInBootstrap
expr: count by (testnet) (increase(Coda_Runtime_process_uptime_ms_total{syncStatus = "BOOTSTRAP"}[2h]) >= 7200000) > 0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because increase only computes against the latest monotonic sequence of samples within the specified range, this alert will not trigger if a node is repeatedly crashing and restarting into bootstrap. That's fine so long as we have another alert that is capturing that. Do you know of a alert that would trigger if the nodes are repeatedly restarting into bootstrap?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nope. There's no such alert for now. Once the daemon is restarted, I believe it would start with a new node/instance in gcloud (this is just my experience with grafana charts). So if we want to capture the data during multiple restart we need to use the label app to capture a specific pod. For now I don't know a good way to automatically collect the data from different instances of one pod together. I wish there is a group by in prometheus query language. I would do a research on that.

@ghost-not-in-the-shell
Copy link
Contributor Author

!ci-build-me

@ghost-not-in-the-shell
Copy link
Contributor Author

!ci-build-me

@ghost-not-in-the-shell
Copy link
Contributor Author

!ci-build-me

@ghost-not-in-the-shell
Copy link
Contributor Author

!ci-build-me

@ghost-not-in-the-shell
Copy link
Contributor Author

!ci-build-me

@ghost-not-in-the-shell
Copy link
Contributor Author

!ci-build-me

@ghost-not-in-the-shell
Copy link
Contributor Author

!ci-build-me

@ghost-not-in-the-shell
Copy link
Contributor Author

!ci-build-me

@deepthiskumar
Copy link
Member

!approved-for-mainnet

@ghost-not-in-the-shell ghost-not-in-the-shell merged commit 149864c into berkeley Oct 18, 2023
@ghost-not-in-the-shell ghost-not-in-the-shell deleted the alert/catchup-and-bootstrap-got-stuck branch October 18, 2023 12:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants