-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve watchdog alert #2467
base: main
Are you sure you want to change the base?
Improve watchdog alert #2467
Conversation
This expression fires the Watchdog alert also if the TSDB is up to date and therefore checks the functionality of the full stack. With the former `vector(1)`, only alertmanager needs to be functional to fire. So In case of a full TSDB storage, the Watchdog still fires and the lack of new metrics goes unnoticed.
Hm, I honestly don't know why the Ubuntu-Pipeline exited with 1. Just the Rule-Expression changed. Maybe it's an issue with the pipeline itself? |
You need to run The initial purpose of the Watchdog alert was to ensure general ability to send alerts is operational. For this reason, As for testing if prometheus can generate alerts, I think this should be a part of prometheus mixin and if I am correct, one of |
Hey @paulfantom, thanks alot for the feedback! In my years of experiencing "dead" Prometheus'es, I never saw one of the Alerts you mentioned firing. I'm gonna check on them. And I agree with you, keeping the Watchdog-Alert as it is (and only Reporting on a down Alertmanager) and having another Alert which tells that there is an Issue with Rule-Evaluation would be a nicer solution. Thanks again! |
Description
This expression fires the Watchdog alert only if the TSDB is up to date and therefore checks the functionality of the full stack.
According to the Runbook, the intention of the Watchdog alert is, quote:
| This is an alert meant to ensure that the entire alerting pipeline is functional.
vector(1)
fires when Alertmanager and Prometheus Pods are up and running. When either of them is down, with this expression, the Watchdog serves it's purpose as intended.However, when TSDB storage runs full, Watchdog with
vector(1)
will still fire, but all alerts with expressions that depend on a metric will not trigger, because said metrics are now missing. So in short, a dysfunctional Prometheus-Stack goes completely unnoticed.This may also happen in the (although unlikely) case the storage runs full faster than Prometheus can trigger a
KubePersistentVolumeFillingUp
alert.Type of change
What type of changes does your code introduce to the kube-prometheus? Put an
x
in the box that apply.CHANGE
(fix or feature that would cause existing functionality to not work as expected)FEATURE
(non-breaking change which adds functionality)BUGFIX
(non-breaking change which fixes an issue)ENHANCEMENT
(non-breaking change which improves existing functionality)NONE
(if none of the other choices apply. Example, tooling, build system, CI, docs, etc.)Changelog entry
Please put a one-line changelog entry below. Later this will be copied to the changelog file.