Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

blockbuilder: Basic alerts #9723

Merged
merged 11 commits into from
Oct 30, 2024
Merged

blockbuilder: Basic alerts #9723

merged 11 commits into from
Oct 30, 2024

Conversation

narqo
Copy link
Contributor

@narqo narqo commented Oct 23, 2024

What this PR does

This PR adds new alerts, that warns about possible scenarios, where the block-builder behaves abnormally. See the updated runbook for the details.

Note that I put the alert as warnings for now, since block-builder is still in its early experimental phase.

@narqo narqo requested review from tacole02 and a team as code owners October 23, 2024 12:29
Copy link
Collaborator

@pracucci pracucci left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pre-approving, but I would really like to see a per-pod alert for the stuck processing.

docs/sources/mimir/manage/mimir-runbooks/_index.md Outdated Show resolved Hide resolved
alert: $.alertName('BlockBuilderNoCycleProcessing'),
'for': '30m',
expr: |||
max by(%(alert_aggregation_labels)s) (histogram_count(increase(cortex_blockbuilder_consume_cycle_duration_seconds[1h]))) == 0
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two comments here:

  1. I would suggest alerting on a per-pod basis, so that we notice if a single block-builder is stuck: max by(cluster, namespace, pod) (histogram_count(increase(cortex_blockbuilder_consume_cycle_duration_seconds[1h]))) == 0
  2. Doing [1h] for 30m, or doing [90m] for 1m is the same thing, so you can easily back test it. Try to run the query max by(cluster, namespace, pod) (histogram_count(increase(cortex_blockbuilder_consume_cycle_duration_seconds[90m]))) == 0 both in dev and prod. You will see some single replicas stuck from time to time. I would like you to investigate the root cause before adding this alert, because if they're false positives we should find a way to fix the false positives, otherwise we should investigate the issue

@narqo narqo force-pushed the vldmr/bb-alerts branch 2 times, most recently from 68cba07 to 93f7e63 Compare October 23, 2024 16:55
Copy link
Contributor

@tacole02 tacole02 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! I left a few comments and suggestions.


This alert fires when the block-builder stops reporting any processed cycles for an unexpectedly long time.

How it **works**:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
How it **works**:
How it works:

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We use bold for UI elements, not emphasis.

Copy link
Contributor Author

@narqo narqo Oct 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure, we export the runbooks to the docs site. (update, we do)

Here I'm sticking to how all other alerts are structured in this document. Let's look at their formatting in bulk separately.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good!

docs/sources/mimir/manage/mimir-runbooks/_index.md Outdated Show resolved Hide resolved
docs/sources/mimir/manage/mimir-runbooks/_index.md Outdated Show resolved Hide resolved
- Block-builder periodically consumes a portion of the backlog from Kafka partition, and processes the consumed data into TSDB blocks. The block-buikder calls these periods "cycles".
- If no cycles were processed during extended period of time, that can indicate that a block-builder instance is stuck and cannot complete cycle processing.

How to **investigate**:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
How to **investigate**:
How to investigate:


How to **investigate**:

- Check block-builder logs to see what its pods are busy with. Troubleshoot based on that.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- Check block-builder logs to see what its pods are busy with. Troubleshoot based on that.
- Check the block-builder logs to see what its pods are busy with. Troubleshoot based on that.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we specify "what its pods are busy with". This sounds a bit jargony and anthropomorphic to me. What do the pods actually do?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similarly, can we provide more detail into "troubleshoot based on that"? It feels a little open-ended.

I think clarifying the first point will help us to clarify the second.

docs/sources/mimir/manage/mimir-runbooks/_index.md Outdated Show resolved Hide resolved
docs/sources/mimir/manage/mimir-runbooks/_index.md Outdated Show resolved Hide resolved
docs/sources/mimir/manage/mimir-runbooks/_index.md Outdated Show resolved Hide resolved
- It compacts and uploads the produced TSDB blocks to the object storage.
- If block-builder encounters issues during compaction or uploading of the blocks if reports the failure metric, that triggers the alert.

How to **investigate**:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
How to **investigate**:
How to investigate:

docs/sources/mimir/manage/mimir-runbooks/_index.md Outdated Show resolved Hide resolved
Copy link
Member

@codesome codesome left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should probably also add a warning when the lag is too high when starting a consumption. We can choose some manual number for now and tweak it later or find a better way to alert on this. Perhaps 3-4M is a good number to start?

Recently, when we were treating OOO histograms as server error, we were lagging quite behind but the consumption was getting initiated and ending as soon as we hit the error. So probably the two warnings in this PR may not have alerted us on it.

narqo and others added 7 commits October 28, 2024 12:53
Signed-off-by: Vladimir Varankin <vladimir.varankin@grafana.com>
Signed-off-by: Vladimir Varankin <vladimir.varankin@grafana.com>
Signed-off-by: Vladimir Varankin <vladimir.varankin@grafana.com>
Co-authored-by: Marco Pracucci <marco@pracucci.com>
Signed-off-by: Vladimir Varankin <vladimir.varankin@grafana.com>
Signed-off-by: Vladimir Varankin <vladimir.varankin@grafana.com>
Co-authored-by: Taylor C <41653732+tacole02@users.noreply.github.com>
Signed-off-by: Vladimir Varankin <vladimir.varankin@grafana.com>
@narqo narqo requested a review from codesome October 28, 2024 13:27
@narqo
Copy link
Contributor Author

narqo commented Oct 28, 2024

@codesome could you take another look. I've added the MimirBlockBuilderLaging warning.

alert: $.alertName('BlockBuilderLaging'),
'for': '1h',
expr: |||
max by(%(alert_aggregation_labels)s, %(per_instance_label)s) (max_over_time(cortex_blockbuilder_consumer_lag_records[60m])) > 4e6
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This reports lag in terms of records, doesn't it? This seems less useful than seconds, because 4M records could be 10 seconds or 10 hours depending on the cell. What do you think?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't have a metric to measure lag in terms of time. You are true that 4M could mean any time range, but for now this should be a usable warning and migrate over to time based measurement with new metrics (maybe the scheduler can help in this).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool. Yeah, I think it can fairly easily.

operations/mimir-mixin/alerts/ingest-storage.libsonnet Outdated Show resolved Hide resolved
Copy link
Member

@codesome codesome left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regarding BlockBuilderLagging:

If the partition was lagging during a cycle above 4M, but if it caught up and went below 4M by the next cycle, we will still send a warning for that for an hour because of the 60m lookback in the query. Because this alert gets active as soon as the first cycle starts (stays in pending), and firing for the entire duration of second cycle.

| <-1st cycle->    |<---2nd cycle-->
|------- 5M -------|------- 3M --------|           # lag metric
|                  |                   |
0m                 1h                  2h          # timeline
|---alert pending--|-------firing------| resolved  # alert state

If we assume that the cycle is 1hr long, we can probably just lookback 10m (i.e. [10m] in the query), and have the for period to be greater than lookback + 1hr, so >70m, so perhaps 75m? This is to avoid warning if during a cycle it went above 4M and in the next cycle it was below 4M again.

| <-1st cycle->    |<---2nd cycle-->
|------- 5M -------|------- 3M --------|  # lag metric
|                  |   |   |           |
0m                 1h  70m 75m         2h # timeline
|-----alert pending----|--inactive--->    # alert state

WDYT?


Edit:

Queries in action

60m
Screenshot 2024-10-28 at 10 20 23 PM

10m
Screenshot 2024-10-28 at 10 20 45 PM

narqo added 2 commits October 29, 2024 10:56
Signed-off-by: Vladimir Varankin <vladimir.varankin@grafana.com>
@narqo narqo requested a review from codesome October 29, 2024 10:42
@narqo narqo merged commit ad2ecd3 into main Oct 30, 2024
31 checks passed
@narqo narqo deleted the vldmr/bb-alerts branch October 30, 2024 11:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants