blockbuilder: Basic alerts #9723

narqo · 2024-10-23T12:29:23Z

What this PR does

This PR adds new alerts, that warns about possible scenarios, where the block-builder behaves abnormally. See the updated runbook for the details.

Note that I put the alert as warnings for now, since block-builder is still in its early experimental phase.

pracucci

Pre-approving, but I would really like to see a per-pod alert for the stuck processing.

docs/sources/mimir/manage/mimir-runbooks/_index.md

pracucci · 2024-10-23T12:41:09Z

operations/mimir-mixin/alerts/ingest-storage.libsonnet

+          alert: $.alertName('BlockBuilderNoCycleProcessing'),
+          'for': '30m',
+          expr: |||
+            max by(%(alert_aggregation_labels)s) (histogram_count(increase(cortex_blockbuilder_consume_cycle_duration_seconds[1h]))) == 0


Two comments here:

I would suggest alerting on a per-pod basis, so that we notice if a single block-builder is stuck: max by(cluster, namespace, pod) (histogram_count(increase(cortex_blockbuilder_consume_cycle_duration_seconds[1h]))) == 0

Doing [1h] for 30m, or doing [90m] for 1m is the same thing, so you can easily back test it. Try to run the query max by(cluster, namespace, pod) (histogram_count(increase(cortex_blockbuilder_consume_cycle_duration_seconds[90m]))) == 0 both in dev and prod. You will see some single replicas stuck from time to time. I would like you to investigate the root cause before adding this alert, because if they're false positives we should find a way to fix the false positives, otherwise we should investigate the issue

tacole02

Looks good! I left a few comments and suggestions.

tacole02 · 2024-10-23T16:54:03Z

docs/sources/mimir/manage/mimir-runbooks/_index.md

+
+This alert fires when the block-builder stops reporting any processed cycles for an unexpectedly long time.
+
+How it **works**:


Suggested change

How it **works**:

How it works:

We use bold for UI elements, not emphasis.

~~I'm not sure, we export the runbooks to the docs site.~~ (update, we do)

Here I'm sticking to how all other alerts are structured in this document. Let's look at their formatting in bulk separately.

Sounds good!

docs/sources/mimir/manage/mimir-runbooks/_index.md

tacole02 · 2024-10-23T16:57:38Z

docs/sources/mimir/manage/mimir-runbooks/_index.md

+- Block-builder periodically consumes a portion of the backlog from Kafka partition, and processes the consumed data into TSDB blocks. The block-buikder calls these periods "cycles".
+- If no cycles were processed during extended period of time, that can indicate that a block-builder instance is stuck and cannot complete cycle processing.
+
+How to **investigate**:


Suggested change

How to **investigate**:

How to investigate:

tacole02 · 2024-10-23T17:06:28Z

docs/sources/mimir/manage/mimir-runbooks/_index.md

+
+How to **investigate**:
+
+- Check block-builder logs to see what its pods are busy with. Troubleshoot based on that.


Suggested change

- Check block-builder logs to see what its pods are busy with. Troubleshoot based on that.

- Check the block-builder logs to see what its pods are busy with. Troubleshoot based on that.

Can we specify "what its pods are busy with". This sounds a bit jargony and anthropomorphic to me. What do the pods actually do?

Similarly, can we provide more detail into "troubleshoot based on that"? It feels a little open-ended.

I think clarifying the first point will help us to clarify the second.

docs/sources/mimir/manage/mimir-runbooks/_index.md

tacole02 · 2024-10-23T17:13:32Z

docs/sources/mimir/manage/mimir-runbooks/_index.md

+- It compacts and uploads the produced TSDB blocks to the object storage.
+- If block-builder encounters issues during compaction or uploading of the blocks if reports the failure metric, that triggers the alert.
+
+How to **investigate**:


Suggested change

How to **investigate**:

How to investigate:

docs/sources/mimir/manage/mimir-runbooks/_index.md

codesome

We should probably also add a warning when the lag is too high when starting a consumption. We can choose some manual number for now and tweak it later or find a better way to alert on this. Perhaps 3-4M is a good number to start?

Recently, when we were treating OOO histograms as server error, we were lagging quite behind but the consumption was getting initiated and ending as soon as we hit the error. So probably the two warnings in this PR may not have alerted us on it.

Signed-off-by: Vladimir Varankin <vladimir.varankin@grafana.com>

Co-authored-by: Marco Pracucci <marco@pracucci.com>

Signed-off-by: Vladimir Varankin <vladimir.varankin@grafana.com>

Co-authored-by: Taylor C <41653732+tacole02@users.noreply.github.com>

Signed-off-by: Vladimir Varankin <vladimir.varankin@grafana.com>

narqo · 2024-10-28T13:27:37Z

@codesome could you take another look. I've added the MimirBlockBuilderLaging warning.

seizethedave · 2024-10-28T15:43:50Z

operations/mimir-mixin/alerts/ingest-storage.libsonnet

+          alert: $.alertName('BlockBuilderLaging'),
+          'for': '1h',
+          expr: |||
+            max by(%(alert_aggregation_labels)s, %(per_instance_label)s) (max_over_time(cortex_blockbuilder_consumer_lag_records[60m])) > 4e6


This reports lag in terms of records, doesn't it? This seems less useful than seconds, because 4M records could be 10 seconds or 10 hours depending on the cell. What do you think?

We don't have a metric to measure lag in terms of time. You are true that 4M could mean any time range, but for now this should be a usable warning and migrate over to time based measurement with new metrics (maybe the scheduler can help in this).

Cool. Yeah, I think it can fairly easily.

operations/mimir-mixin/alerts/ingest-storage.libsonnet

codesome

Regarding BlockBuilderLagging:

If the partition was lagging during a cycle above 4M, but if it caught up and went below 4M by the next cycle, we will still send a warning for that for an hour because of the 60m lookback in the query. Because this alert gets active as soon as the first cycle starts (stays in pending), and firing for the entire duration of second cycle.

| <-1st cycle->    |<---2nd cycle-->
|------- 5M -------|------- 3M --------|           # lag metric
|                  |                   |
0m                 1h                  2h          # timeline
|---alert pending--|-------firing------| resolved  # alert state

If we assume that the cycle is 1hr long, we can probably just lookback 10m (i.e. [10m] in the query), and have the for period to be greater than lookback + 1hr, so >70m, so perhaps 75m? This is to avoid warning if during a cycle it went above 4M and in the next cycle it was below 4M again.

| <-1st cycle->    |<---2nd cycle-->
|------- 5M -------|------- 3M --------|  # lag metric
|                  |   |   |           |
0m                 1h  70m 75m         2h # timeline
|-----alert pending----|--inactive--->    # alert state

WDYT?

Edit:

Queries in action

60m

10m

Signed-off-by: Vladimir Varankin <vladimir.varankin@grafana.com>

narqo requested review from tacole02 and a team as code owners October 23, 2024 12:29

pracucci approved these changes Oct 23, 2024

View reviewed changes

narqo force-pushed the vldmr/bb-alerts branch 2 times, most recently from 68cba07 to 93f7e63 Compare October 23, 2024 16:55

tacole02 approved these changes Oct 23, 2024

View reviewed changes

codesome reviewed Oct 24, 2024

View reviewed changes

narqo and others added 7 commits October 28, 2024 12:53

mimir-mixin: basic alerting for block-builder

cc753e7

Signed-off-by: Vladimir Varankin <vladimir.varankin@grafana.com>

runbook

88e09bc

Signed-off-by: Vladimir Varankin <vladimir.varankin@grafana.com>

rebuild assets

263bb3f

Signed-off-by: Vladimir Varankin <vladimir.varankin@grafana.com>

Update docs/sources/mimir/manage/mimir-runbooks/_index.md

133bab1

Co-authored-by: Marco Pracucci <marco@pracucci.com>

per-instance alerting

18d49c0

Signed-off-by: Vladimir Varankin <vladimir.varankin@grafana.com>

rebuild assets

a28773c

Signed-off-by: Vladimir Varankin <vladimir.varankin@grafana.com>

Apply suggestions from code review

4cd0cd1

Co-authored-by: Taylor C <41653732+tacole02@users.noreply.github.com>

narqo force-pushed the vldmr/bb-alerts branch from 738e5f5 to 4cd0cd1 Compare October 28, 2024 12:38

add MimirBlockBuilderLaging

ed8f9e4

Signed-off-by: Vladimir Varankin <vladimir.varankin@grafana.com>

narqo requested a review from codesome October 28, 2024 13:27

fixup! rebuild assets

d08d36e

narqo force-pushed the vldmr/bb-alerts branch from 18d8064 to d08d36e Compare October 28, 2024 14:42

seizethedave reviewed Oct 28, 2024

View reviewed changes

codesome reviewed Oct 29, 2024

View reviewed changes

narqo added 2 commits October 29, 2024 10:56

improve MimirBlockBuilderLagging

3175cfe

Signed-off-by: Vladimir Varankin <vladimir.varankin@grafana.com>

fixup! rebuild assets

20ca095

narqo force-pushed the vldmr/bb-alerts branch from 1e2b789 to 20ca095 Compare October 29, 2024 10:41

narqo requested a review from codesome October 29, 2024 10:42

narqo merged commit ad2ecd3 into main Oct 30, 2024
31 checks passed

narqo deleted the vldmr/bb-alerts branch October 30, 2024 11:26

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

blockbuilder: Basic alerts #9723

blockbuilder: Basic alerts #9723

narqo commented Oct 23, 2024

pracucci left a comment

pracucci Oct 23, 2024

tacole02 left a comment

tacole02 Oct 23, 2024

tacole02 Oct 23, 2024

narqo Oct 28, 2024 •

edited

Loading

tacole02 Oct 28, 2024

tacole02 Oct 23, 2024

tacole02 Oct 23, 2024

tacole02 Oct 23, 2024

tacole02 Oct 23, 2024

tacole02 Oct 23, 2024

codesome left a comment

narqo commented Oct 28, 2024

seizethedave Oct 28, 2024

codesome Oct 29, 2024

seizethedave Oct 29, 2024

codesome left a comment •

edited

Loading


		This alert fires when the block-builder stops reporting any processed cycles for an unexpectedly long time.

		How it works:


		How to investigate:

		- Check block-builder logs to see what its pods are busy with. Troubleshoot based on that.

	- Check block-builder logs to see what its pods are busy with. Troubleshoot based on that.
	- Check the block-builder logs to see what its pods are busy with. Troubleshoot based on that.

blockbuilder: Basic alerts #9723

blockbuilder: Basic alerts #9723

Conversation

narqo commented Oct 23, 2024

What this PR does

pracucci left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tacole02 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

narqo Oct 28, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codesome left a comment

Choose a reason for hiding this comment

narqo commented Oct 28, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codesome left a comment • edited Loading

Choose a reason for hiding this comment

narqo Oct 28, 2024 •

edited

Loading

codesome left a comment •

edited

Loading