-
Notifications
You must be signed in to change notification settings - Fork 543
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
blockbuilder: Basic alerts #9723
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pre-approving, but I would really like to see a per-pod alert for the stuck processing.
alert: $.alertName('BlockBuilderNoCycleProcessing'), | ||
'for': '30m', | ||
expr: ||| | ||
max by(%(alert_aggregation_labels)s) (histogram_count(increase(cortex_blockbuilder_consume_cycle_duration_seconds[1h]))) == 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Two comments here:
- I would suggest alerting on a per-pod basis, so that we notice if a single block-builder is stuck:
max by(cluster, namespace, pod) (histogram_count(increase(cortex_blockbuilder_consume_cycle_duration_seconds[1h]))) == 0
- Doing
[1h]
for30m
, or doing[90m]
for1m
is the same thing, so you can easily back test it. Try to run the querymax by(cluster, namespace, pod) (histogram_count(increase(cortex_blockbuilder_consume_cycle_duration_seconds[90m]))) == 0
both in dev and prod. You will see some single replicas stuck from time to time. I would like you to investigate the root cause before adding this alert, because if they're false positives we should find a way to fix the false positives, otherwise we should investigate the issue
68cba07
to
93f7e63
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good! I left a few comments and suggestions.
|
||
This alert fires when the block-builder stops reporting any processed cycles for an unexpectedly long time. | ||
|
||
How it **works**: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How it **works**: | |
How it works: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We use bold for UI elements, not emphasis.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure, we export the runbooks to the docs site. (update, we do)
Here I'm sticking to how all other alerts are structured in this document. Let's look at their formatting in bulk separately.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good!
- Block-builder periodically consumes a portion of the backlog from Kafka partition, and processes the consumed data into TSDB blocks. The block-buikder calls these periods "cycles". | ||
- If no cycles were processed during extended period of time, that can indicate that a block-builder instance is stuck and cannot complete cycle processing. | ||
|
||
How to **investigate**: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How to **investigate**: | |
How to investigate: |
|
||
How to **investigate**: | ||
|
||
- Check block-builder logs to see what its pods are busy with. Troubleshoot based on that. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Check block-builder logs to see what its pods are busy with. Troubleshoot based on that. | |
- Check the block-builder logs to see what its pods are busy with. Troubleshoot based on that. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we specify "what its pods are busy with". This sounds a bit jargony and anthropomorphic to me. What do the pods actually do?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similarly, can we provide more detail into "troubleshoot based on that"? It feels a little open-ended.
I think clarifying the first point will help us to clarify the second.
- It compacts and uploads the produced TSDB blocks to the object storage. | ||
- If block-builder encounters issues during compaction or uploading of the blocks if reports the failure metric, that triggers the alert. | ||
|
||
How to **investigate**: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How to **investigate**: | |
How to investigate: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should probably also add a warning when the lag is too high when starting a consumption. We can choose some manual number for now and tweak it later or find a better way to alert on this. Perhaps 3-4M is a good number to start?
Recently, when we were treating OOO histograms as server error, we were lagging quite behind but the consumption was getting initiated and ending as soon as we hit the error. So probably the two warnings in this PR may not have alerted us on it.
Signed-off-by: Vladimir Varankin <vladimir.varankin@grafana.com>
Signed-off-by: Vladimir Varankin <vladimir.varankin@grafana.com>
Co-authored-by: Marco Pracucci <marco@pracucci.com>
Signed-off-by: Vladimir Varankin <vladimir.varankin@grafana.com>
Signed-off-by: Vladimir Varankin <vladimir.varankin@grafana.com>
Co-authored-by: Taylor C <41653732+tacole02@users.noreply.github.com>
738e5f5
to
4cd0cd1
Compare
Signed-off-by: Vladimir Varankin <vladimir.varankin@grafana.com>
@codesome could you take another look. I've added the |
18d8064
to
d08d36e
Compare
alert: $.alertName('BlockBuilderLaging'), | ||
'for': '1h', | ||
expr: ||| | ||
max by(%(alert_aggregation_labels)s, %(per_instance_label)s) (max_over_time(cortex_blockbuilder_consumer_lag_records[60m])) > 4e6 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This reports lag in terms of records, doesn't it? This seems less useful than seconds, because 4M records could be 10 seconds or 10 hours depending on the cell. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't have a metric to measure lag in terms of time. You are true that 4M could mean any time range, but for now this should be a usable warning and migrate over to time based measurement with new metrics (maybe the scheduler can help in this).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool. Yeah, I think it can fairly easily.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Regarding BlockBuilderLagging
:
If the partition was lagging during a cycle above 4M, but if it caught up and went below 4M by the next cycle, we will still send a warning for that for an hour because of the 60m lookback in the query. Because this alert gets active as soon as the first cycle starts (stays in pending), and firing for the entire duration of second cycle.
| <-1st cycle-> |<---2nd cycle-->
|------- 5M -------|------- 3M --------| # lag metric
| | |
0m 1h 2h # timeline
|---alert pending--|-------firing------| resolved # alert state
If we assume that the cycle is 1hr long, we can probably just lookback 10m (i.e. [10m]
in the query), and have the for
period to be greater than lookback + 1hr, so >70m, so perhaps 75m? This is to avoid warning if during a cycle it went above 4M and in the next cycle it was below 4M again.
| <-1st cycle-> |<---2nd cycle-->
|------- 5M -------|------- 3M --------| # lag metric
| | | | |
0m 1h 70m 75m 2h # timeline
|-----alert pending----|--inactive---> # alert state
WDYT?
Edit:
Queries in action
Signed-off-by: Vladimir Varankin <vladimir.varankin@grafana.com>
1e2b789
to
20ca095
Compare
What this PR does
This PR adds new alerts, that warns about possible scenarios, where the
block-builder
behaves abnormally. See the updated runbook for the details.Note that I put the alert as warnings for now, since block-builder is still in its early experimental phase.