Mixin: Add and update alerts #2644

v-zhuravlev · 2023-03-27T17:29:54Z

e1f3b37 Add %(nodeExporterSelector)s to Network and conntrack alerts
af23597 Add NodeDiskIOSaturation alert
e2e7864 Set 'at' everywhere as preposition for instance
282c129 Decrease NodeNetwork*Errs pending period
94535a9 Add failed systemd service alert
f685865 Add CPU and memory alerts
4d19e84 Decrease NodeFilesystem pending time to 15m
0336d13 Add mountpoint to NodeFilesystem alerts

docs/node-mixin/alerts/alerts.libsonnet

discordianfish · 2023-03-28T11:37:11Z

docs/node-mixin/alerts/alerts.libsonnet

@@ -309,6 +309,102 @@
              description: 'File descriptors limit at {{ $labels.instance }} is currently at {{ printf "%.2f" $value }}%.',
            },
          },
+          {
+            alert: 'NodeCPUHighUsage',


High CPU usage is not a problem and can just be an indicator or properly utilizing your machine, so I'd remove these

perhaps, as long as we can alert on high system load(saturation).

Yeah, CPU usage is good. :) I mean, this would be a case for the "info" level alerts that I like to promote, but I don't think we have them here in the mixin.

(Info level alerts notify nobody, but you could look at the alerts page while troubleshooting. They point to things that are not problems per se and might be OK, but which you might be interested while there is an actual incident happening.)

Yeah I'd be fine with a 'info' level severity. No reason to now just introduce that now that we're on this.

I'll make it an info according to this guideline:

info for alerts that do not require any action by itself but mark something as “out of the ordinary”. Those alerts aren’t usually routed anywhere, but can be inspected during troubleshooting.

Perhaps a warning if the usage stays above 98% for 1h would be viable? That would be a case where the host is at capacity and scheduling more tasks there would result in performance degradation. It is a risk folks can accept but something that should be considered as part of the capacity plan.

discordianfish · 2023-03-28T11:37:46Z

docs/node-mixin/alerts/alerts.libsonnet

+            },
+          },
+          {
+            alert: 'NodeSystemSaturation',


Not sure about this, this miiight make sense but I also leaning towards not doing this. @SuperQ wdyt?

I think this is is really helpful to detect system performance degradation, https://www.brendangregg.com/blog/2017-08-08/linux-load-averages.html

The question is are there in pragmatic reality scenarios where a high load is nothing to warn about. And the article explains how other, non cpu related saturation will also increase this. (But thanks for the link, super interesting to see the actual patch that introduced this confusion :))

this alert triggers on load per core. There are always exceptions, in such situations 'silence' could also help.

The question is are there in pragmatic reality scenarios where a high load is nothing to warn about. And the article explains how other, non cpu related saturation will also increase this. (But thanks for the link, super interesting to see the actual patch that introduced this confusion :))

that's why this alert is called NodeSystemSaturation, not CPUsaturation, btw :)

I totally remember scenarios where the load metric was high in legitimate use cases without being a problem, but that's long ago, and a lot has changed in the kernel since then. I heard opinions that load average is now somewhat useful as a metric, others state the opposite, and I don't feel qualified to make the call.

Maybe also info severity? Alerting/paging on this kinda goes against alerting on actual impact (as oppose to alert on response times of services running on the overloaded node)

If the load metric here actually tells us about actual saturation, then I would say "warning" is fine. It is an actionable alert then, at least for the many scenarios where you don't want to run your systems over-saturated all the time. It's just not urgent enough to wake someone up.

IMHO "info" level alerts are for conditions that are completely fine on their own but could be hints towards a possible cause while an incident is happening.

Yeah lets convert it to warning

ok, decreased to warning, agree

docs/node-mixin/alerts/alerts.libsonnet

discordianfish · 2023-03-28T11:39:33Z

Thanks! I think these need a bit discussion. @SuperQ @pgier @beorn7 wdyt about these?

beorn7

Good stuff. But for some alerts, I just don't feel qualified to judge, see comments.

I have a few punctuation nits, and more importantly by making some of the new parameters configurable, we can avoid any trouble for those that would like to keep the old ones or don't like the new alerts.

docs/node-mixin/alerts/alerts.libsonnet

beorn7 · 2023-03-29T17:59:19Z

docs/node-mixin/alerts/alerts.libsonnet

@@ -53,13 +53,13 @@
                node_filesystem_readonly{%(nodeExporterSelector)s,%(fsSelector)s,%(fsMountpointSelector)s} == 0
              )
            ||| % $._config,
-            'for': '30m',
+            'for': '15m',


I don't feel strongly about the exact time here. Maybe it's worth making it configurable, similar to fsSpaceAvailableWarningThreshold? In a way, the one depends on the other. If your fsSpaceAvailableWarningThreshold is conservative, you might be more relaxed about the for time and vice versa.

Unsure if it is really necessary to tune for timings as well, if you really need to tune it you can patch with jsonnnet. So I would keep it static.

AFAIK, the idea of for is to be sure that problem is permanent and won't resolve itself. At the same time, if it is a real problem, 30mins feels too long period to wait (especially for actually actionable alerts like 'Filesystem is at critical level and could fill up any minute' Plus, if disk is rather small and filling up fast, it may fill up to 100% faster than 30minutes.

Sure, I agree with your reasoning, but you can use the same argument to set it to 5m or an hour… the idea here is to find a good middle ground between noisy alerts and alerting too late. I don't know if the original value was based on a lot of research, and I was hoping you could present numbers like e.g. "With 15m in our prod setup, we got just 5% more false positives, but 40% less false negatives".

beorn7 · 2023-03-29T17:59:41Z

docs/node-mixin/alerts/alerts.libsonnet

@@ -71,13 +71,13 @@
                node_filesystem_readonly{%(nodeExporterSelector)s,%(fsSelector)s,%(fsMountpointSelector)s} == 0
              )
            ||| % $._config,
-            'for': '30m',
+            'for': '15m',


beorn7 · 2023-03-29T18:01:32Z

docs/node-mixin/alerts/alerts.libsonnet

@@ -129,13 +129,13 @@
                node_filesystem_readonly{%(nodeExporterSelector)s,%(fsSelector)s,%(fsMountpointSelector)s} == 0
              )
            ||| % $._config,
-            'for': '1h',
+            'for': '15m',


See above. Curious why we had this different from the file space.

AFAIK, the idea of 'for' is to be sure that problem is permanent and won't resolve itself. At the same time, if it is a real actionable problem, 1h could be too long period to wait. Plus, if disk is rather small and filling up fast, it may fill up to 100% faster than 1h.

docs/node-mixin/alerts/alerts.libsonnet

beorn7 · 2023-03-29T18:07:26Z

docs/node-mixin/alerts/alerts.libsonnet

@@ -309,6 +309,102 @@
              description: 'File descriptors limit at {{ $labels.instance }} is currently at {{ printf "%.2f" $value }}%.',
            },
          },
+          {
+            alert: 'NodeCPUHighUsage',


Yeah, CPU usage is good. :) I mean, this would be a case for the "info" level alerts that I like to promote, but I don't think we have them here in the mixin.

(Info level alerts notify nobody, but you could look at the alerts page while troubleshooting. They point to things that are not problems per se and might be OK, but which you might be interested while there is an actual incident happening.)

beorn7 · 2023-03-29T18:11:11Z

docs/node-mixin/alerts/alerts.libsonnet

+            },
+          },
+          {
+            alert: 'NodeSystemSaturation',


I totally remember scenarios where the load metric was high in legitimate use cases without being a problem, but that's long ago, and a lot has changed in the kernel since then. I heard opinions that load average is now somewhat useful as a metric, others state the opposite, and I don't feel qualified to make the call.

docs/node-mixin/alerts/alerts.libsonnet

v-zhuravlev · 2023-04-05T18:32:58Z

Ok, I made all new alerts thresholds configurable, please let me know what you think.

beorn7

Generally, this looks good to me. As said, I don't feel strongly about the for timing. From my point of view, I cannot say if the one is better than the other (but I do like that this PR makes them consistent – that one higher for duration for FDs was weird).

I also cannot really comment on the alert based on load. @discordianfish and @SuperQ I leave it to you to make the call.

Everything else is LGTM from my side.

discordianfish · 2023-04-26T10:45:37Z

Beside this one comment, LGTM. I'm also fine with the 'for' changes. 1h/30m feel unusual long. Let's go with the proposed 'for' times for now, we can make it configurable later.

discordianfish · 2023-05-17T12:15:47Z

Requires rebase

v-zhuravlev · 2023-05-22T06:58:43Z

Rebased

SuperQ · 2023-05-24T08:16:33Z

Sorry, this needs another rebase in order to pick up some unrelated bug fixes.

v-zhuravlev · 2023-05-24T09:00:06Z

np, rebased just now

This helps to identify alerting filesystem. Signed-off-by: Vitaly Zhuravlev <v-zhuravlev@users.noreply.github.com>

30m is too long and there is a risk of running out of disk space/inodes completely if something is filling up disk very fast (like log file). Signed-off-by: Vitaly Zhuravlev <v-zhuravlev@users.noreply.github.com>

Signed-off-by: Vitaly Zhuravlev <v-zhuravlev@users.noreply.github.com>

discordianfish · 2023-07-01T11:14:44Z

@v-zhuravlev Let me know when this is ready to get reviewed

v-zhuravlev · 2023-07-01T11:16:05Z

It’s ready, I reverted ‘for’ times Сб, 1 июля 2023 г. в 14:14, discordianfish ***@***.***>:

…

@v-zhuravlev <https://github.com/v-zhuravlev> Let me know when this is ready to get reviewed — Reply to this email directly, view it on GitHub <#2644 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ADROS26DI7SQZUEHKLMYYN3XOABC5ANCNFSM6AAAAAAWJOBWOY> . You are receiving this because you were mentioned.Message ID: ***@***.***>

discordianfish

Great, LGTM!

discordianfish

Great, LGTM!

Signed-off-by: Vitaly Zhuravlev <v-zhuravlev@users.noreply.github.com>

v-zhuravlev changed the title ~~Mixin: Alerts: Add mountpoint to NodeFilesystem alerts~~ Mixin: Add and update alerts Mar 27, 2023

v-zhuravlev force-pushed the mixin_alerts branch 7 times, most recently from 7b78643 to 4a83ed2 Compare March 27, 2023 22:58

discordianfish reviewed Mar 28, 2023

View reviewed changes

docs/node-mixin/alerts/alerts.libsonnet Show resolved Hide resolved

discordianfish reviewed Mar 28, 2023

View reviewed changes

docs/node-mixin/alerts/alerts.libsonnet Outdated Show resolved Hide resolved

beorn7 requested changes Mar 29, 2023

View reviewed changes

beorn7 reviewed Apr 12, 2023

View reviewed changes

v-zhuravlev mentioned this pull request Apr 21, 2023

feat: node exporter mixin large update #2665

Closed

25 tasks

v-zhuravlev force-pushed the mixin_alerts branch from eca2f50 to 864cf47 Compare April 26, 2023 16:38

v-zhuravlev requested a review from beorn7 April 26, 2023 16:41

discordianfish mentioned this pull request May 5, 2023

docs/node-mixin: add NodeHighCPUUtilization alert #2679

Closed

v-zhuravlev force-pushed the mixin_alerts branch from 864cf47 to 17f341f Compare May 22, 2023 06:56

v-zhuravlev force-pushed the mixin_alerts branch from 17f341f to 2327248 Compare May 24, 2023 08:58

SuperQ requested review from discordianfish and jcpunk May 24, 2023 09:16

jcpunk approved these changes May 29, 2023

View reviewed changes

v-zhuravlev force-pushed the mixin_alerts branch from cbf2ab3 to c02ce18 Compare June 29, 2023 15:26

v-zhuravlev and others added 19 commits June 29, 2023 23:26

Add mountpoint to NodeFilesystem alerts

fc967aa

This helps to identify alerting filesystem. Signed-off-by: Vitaly Zhuravlev <v-zhuravlev@users.noreply.github.com>

Decrease NodeFilesystem pending time to 15m

0e0399d

30m is too long and there is a risk of running out of disk space/inodes completely if something is filling up disk very fast (like log file). Signed-off-by: Vitaly Zhuravlev <v-zhuravlev@users.noreply.github.com>

Add CPU and memory alerts

fd2d62a

Signed-off-by: Vitaly Zhuravlev <v-zhuravlev@users.noreply.github.com>

Add failed systemd service alert

7479418

Signed-off-by: Vitaly Zhuravlev <v-zhuravlev@users.noreply.github.com>

Decrease NodeNetwork*Errs pending period

3d8075d

Signed-off-by: Vitaly Zhuravlev <v-zhuravlev@users.noreply.github.com>

Set 'at' everywhere as preposition for instance

614030b

Signed-off-by: Vitaly Zhuravlev <v-zhuravlev@users.noreply.github.com>

Add NodeDiskIOSaturation alert

94fc82e

Signed-off-by: Vitaly Zhuravlev <v-zhuravlev@users.noreply.github.com>

Add %(nodeExporterSelector)s to Network and conntrack alerts

962de6c

Signed-off-by: Vitaly Zhuravlev <v-zhuravlev@users.noreply.github.com>

Add diskDevice selector

c3ec6e8

Signed-off-by: Vitaly Zhuravlev <v-zhuravlev@users.noreply.github.com>

Fix NodeMemoryHighUtilization alert

e15e7d6

Signed-off-by: Vitaly Zhuravlev <v-zhuravlev@users.noreply.github.com>

Add NodeSystemSaturation and NodeMemoryMajorPagesFaults

580c497

Signed-off-by: Vitaly Zhuravlev <v-zhuravlev@users.noreply.github.com>

Decrease NodeSystemdServiceFailed severity to warning

da32f8d

Signed-off-by: Vitaly Zhuravlev <v-zhuravlev@users.noreply.github.com>

Extend alert description

e48e790

Signed-off-by: Vitaly Zhuravlev <v-zhuravlev@users.noreply.github.com>

Add comma after 'mounted on'

2111e70

Signed-off-by: Vitaly Zhuravlev <v-zhuravlev@users.noreply.github.com>

Add thresholds for memory alerts

77ae769

Signed-off-by: Vitaly Zhuravlev <v-zhuravlev@users.noreply.github.com>

Add thresholds for memory, disk and system alerts

6bdc1d9

Signed-off-by: Vitaly Zhuravlev <v-zhuravlev@users.noreply.github.com>

Set severity to NodeCPUHighUsage to info

b7dfb32

Signed-off-by: Vitaly Zhuravlev <v-zhuravlev@users.noreply.github.com>

Update NodeSystemSaturation severity

3e250a9

Signed-off-by: Vitaly Zhuravlev <v-zhuravlev@users.noreply.github.com>

Revert alerts pending durtions

e8d7f4e

Signed-off-by: Vitaly Zhuravlev <v-zhuravlev@users.noreply.github.com>

v-zhuravlev force-pushed the mixin_alerts branch from c02ce18 to e8d7f4e Compare June 29, 2023 15:27

discordianfish approved these changes Jul 3, 2023

View reviewed changes

discordianfish requested a review from SuperQ July 3, 2023 10:00

SuperQ merged commit ed57c15 into prometheus:master Jul 4, 2023
2 checks passed

This was referenced Jul 5, 2023

[kube-prometheus-stack] update node-exporter alerts mixin prometheus-community/helm-charts#3553

Closed

Update node-exporter mixin prometheus-operator/kube-prometheus#2157

Closed

v-zhuravlev added a commit to grafana/node_exporter that referenced this pull request Jul 15, 2023

Revert alerts pending durtions as agreed here prometheus#2644

720a114

Signed-off-by: Vitaly Zhuravlev <v-zhuravlev@users.noreply.github.com>

Mixin: Add and update alerts #2644

Mixin: Add and update alerts #2644

Conversation

v-zhuravlev commented Mar 27, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jcpunk May 5, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

v-zhuravlev Mar 28, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

discordianfish commented Mar 28, 2023

beorn7 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

v-zhuravlev Apr 5, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

v-zhuravlev commented Apr 5, 2023

beorn7 left a comment

Choose a reason for hiding this comment

discordianfish commented Apr 26, 2023

discordianfish commented May 17, 2023

v-zhuravlev commented May 22, 2023

SuperQ commented May 24, 2023

v-zhuravlev commented May 24, 2023

discordianfish commented Jul 1, 2023

v-zhuravlev commented Jul 1, 2023 via email

discordianfish left a comment

Choose a reason for hiding this comment

discordianfish left a comment

Choose a reason for hiding this comment

v-zhuravlev commented Mar 27, 2023 •

edited

Loading

jcpunk May 5, 2023 •

edited

Loading

v-zhuravlev Mar 28, 2023 •

edited

Loading

v-zhuravlev Apr 5, 2023 •

edited

Loading