Skip to content

Commit ad2ecd3

Browse files
narqopracuccitacole02
authored
blockbuilder: Basic alerts (#9723)
* mimir-mixin: basic alerting for block-builder Signed-off-by: Vladimir Varankin <vladimir.varankin@grafana.com> * runbook Signed-off-by: Vladimir Varankin <vladimir.varankin@grafana.com> * rebuild assets Signed-off-by: Vladimir Varankin <vladimir.varankin@grafana.com> * Update docs/sources/mimir/manage/mimir-runbooks/_index.md Co-authored-by: Marco Pracucci <marco@pracucci.com> * per-instance alerting Signed-off-by: Vladimir Varankin <vladimir.varankin@grafana.com> * rebuild assets Signed-off-by: Vladimir Varankin <vladimir.varankin@grafana.com> * Apply suggestions from code review Co-authored-by: Taylor C <41653732+tacole02@users.noreply.github.com> * add MimirBlockBuilderLaging Signed-off-by: Vladimir Varankin <vladimir.varankin@grafana.com> * fixup! rebuild assets * improve MimirBlockBuilderLagging Signed-off-by: Vladimir Varankin <vladimir.varankin@grafana.com> * fixup! rebuild assets --------- Signed-off-by: Vladimir Varankin <vladimir.varankin@grafana.com> Co-authored-by: Marco Pracucci <marco@pracucci.com> Co-authored-by: Taylor C <41653732+tacole02@users.noreply.github.com>
1 parent eda1a4b commit ad2ecd3

File tree

5 files changed

+169
-0
lines changed

5 files changed

+169
-0
lines changed

docs/sources/mimir/manage/mimir-runbooks/_index.md

Lines changed: 41 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1611,6 +1611,47 @@ How to **fix**:
16111611
16121612
1. Once ingesters are stable, revert the temporarily config applied in the previous step.
16131613
1614+
### MimirBlockBuilderNoCycleProcessing
1615+
1616+
This alert fires when the block-builder stops reporting any processed cycles for an unexpectedly long time.
1617+
1618+
How it **works**:
1619+
1620+
- The block-builder periodically consumes a portion of the backlog from Kafka partition, and processes the consumed data into TSDB blocks. The block-builder calls these periods "cycles".
1621+
- If the block-builder doesn't process any cycles for an extended period of time, this could indicate that a block-builder instance is stuck and cannot complete cycle processing.
1622+
1623+
How to **investigate**:
1624+
1625+
- Check the block-builder logs to see what its pods have been busy with. The block-builder logs the `start consuming` and `done consuming` log messages, that mark per-partition conume-cycles. These log records include the details about the cycle, the Kafka topic's offsets, etc. Troubleshoot based on that.
1626+
1627+
### MimirBlockBuilderLagging
1628+
1629+
This alert fires when the block-builder instances report a large number of unprocessed records in the Kafka partitions.
1630+
1631+
How it **works**:
1632+
1633+
- When the block-builder starts a new consume cycle, it checks how many records the Kafka partition has in the backlog. This number is tracked in the `cortex_blockbuilder_consumer_lag_records` metric.
1634+
- The block-builder must consume and process these records into TSDB blocks.
1635+
- At the end of the processing, the block-builder commits the offset of the last fully processed record into Kafka.
1636+
- If the block-builder reports high values in the lag, this could indicate that a block-builder instance cannot fully process and commit Kafka record.
1637+
1638+
How to **investigate**:
1639+
1640+
- Check if the per-partition lag, reported by the `cortex_blockbuilder_consumer_lag_records` metric, has been growing over the past hours.
1641+
- Explore the block-builder logs for any errors reported while it processed the partition.
1642+
1643+
### MimirBlockBuilderCompactAndUploadFailed
1644+
1645+
How it **works**:
1646+
1647+
- The block-builder periodically consumes data from a Kafka topic and processes the consumed data into TSDB blocks.
1648+
- It compacts and uploads the produced TSDB blocks to object storage.
1649+
- If the block-builder encounters issues while compacting or uploading the blocks, it reports the failure metric, which then triggers the alert.
1650+
1651+
How to **investigate**:
1652+
1653+
- Explore the block-builder logs to check what errors are there.
1654+
16141655
## Errors catalog
16151656
16161657
Mimir has some codified error IDs that you might see in HTTP responses or logs.

operations/helm/tests/metamonitoring-values-generated/mimir-distributed/templates/metamonitoring/mixin-alerts.yaml

Lines changed: 27 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1163,6 +1163,33 @@ spec:
11631163
for: 5m
11641164
labels:
11651165
severity: critical
1166+
- alert: MimirBlockBuilderNoCycleProcessing
1167+
annotations:
1168+
message: Mimir {{ $labels.pod }} in {{ $labels.cluster }}/{{ $labels.namespace }} has not processed cycles in the past hour.
1169+
runbook_url: https://grafana.com/docs/mimir/latest/operators-guide/mimir-runbooks/#mimirblockbuildernocycleprocessing
1170+
expr: |
1171+
max by(cluster, namespace, pod) (histogram_count(increase(cortex_blockbuilder_consume_cycle_duration_seconds[60m]))) == 0
1172+
for: 5m
1173+
labels:
1174+
severity: warning
1175+
- alert: MimirBlockBuilderLagging
1176+
annotations:
1177+
message: Mimir {{ $labels.pod }} in {{ $labels.cluster }}/{{ $labels.namespace }} reports partition lag of {{ printf "%.2f" $value }}%.
1178+
runbook_url: https://grafana.com/docs/mimir/latest/operators-guide/mimir-runbooks/#mimirblockbuilderlagging
1179+
expr: |
1180+
max by(cluster, namespace, pod) (max_over_time(cortex_blockbuilder_consumer_lag_records[10m])) > 4e6
1181+
for: 75m
1182+
labels:
1183+
severity: warning
1184+
- alert: MimirBlockBuilderCompactAndUploadFailed
1185+
annotations:
1186+
message: Mimir {{ $labels.pod }} in {{ $labels.cluster }}/{{ $labels.namespace }} fails to compact and upload blocks.
1187+
runbook_url: https://grafana.com/docs/mimir/latest/operators-guide/mimir-runbooks/#mimirblockbuildercompactanduploadfailed
1188+
expr: |
1189+
sum by (cluster, namespace, pod) (rate(cortex_blockbuilder_tsdb_compact_and_upload_failed_total[1m])) > 0
1190+
for: 5m
1191+
labels:
1192+
severity: warning
11661193
- name: mimir_continuous_test
11671194
rules:
11681195
- alert: MimirContinuousTestNotRunningOnWrites

operations/mimir-mixin-compiled-baremetal/alerts.yaml

Lines changed: 27 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1137,6 +1137,33 @@ groups:
11371137
for: 5m
11381138
labels:
11391139
severity: critical
1140+
- alert: MimirBlockBuilderNoCycleProcessing
1141+
annotations:
1142+
message: Mimir {{ $labels.instance }} in {{ $labels.cluster }}/{{ $labels.namespace }} has not processed cycles in the past hour.
1143+
runbook_url: https://grafana.com/docs/mimir/latest/operators-guide/mimir-runbooks/#mimirblockbuildernocycleprocessing
1144+
expr: |
1145+
max by(cluster, namespace, instance) (histogram_count(increase(cortex_blockbuilder_consume_cycle_duration_seconds[60m]))) == 0
1146+
for: 5m
1147+
labels:
1148+
severity: warning
1149+
- alert: MimirBlockBuilderLagging
1150+
annotations:
1151+
message: Mimir {{ $labels.instance }} in {{ $labels.cluster }}/{{ $labels.namespace }} reports partition lag of {{ printf "%.2f" $value }}%.
1152+
runbook_url: https://grafana.com/docs/mimir/latest/operators-guide/mimir-runbooks/#mimirblockbuilderlagging
1153+
expr: |
1154+
max by(cluster, namespace, instance) (max_over_time(cortex_blockbuilder_consumer_lag_records[10m])) > 4e6
1155+
for: 75m
1156+
labels:
1157+
severity: warning
1158+
- alert: MimirBlockBuilderCompactAndUploadFailed
1159+
annotations:
1160+
message: Mimir {{ $labels.instance }} in {{ $labels.cluster }}/{{ $labels.namespace }} fails to compact and upload blocks.
1161+
runbook_url: https://grafana.com/docs/mimir/latest/operators-guide/mimir-runbooks/#mimirblockbuildercompactanduploadfailed
1162+
expr: |
1163+
sum by (cluster, namespace, instance) (rate(cortex_blockbuilder_tsdb_compact_and_upload_failed_total[1m])) > 0
1164+
for: 5m
1165+
labels:
1166+
severity: warning
11401167
- name: mimir_continuous_test
11411168
rules:
11421169
- alert: MimirContinuousTestNotRunningOnWrites

operations/mimir-mixin-compiled/alerts.yaml

Lines changed: 27 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1151,6 +1151,33 @@ groups:
11511151
for: 5m
11521152
labels:
11531153
severity: critical
1154+
- alert: MimirBlockBuilderNoCycleProcessing
1155+
annotations:
1156+
message: Mimir {{ $labels.pod }} in {{ $labels.cluster }}/{{ $labels.namespace }} has not processed cycles in the past hour.
1157+
runbook_url: https://grafana.com/docs/mimir/latest/operators-guide/mimir-runbooks/#mimirblockbuildernocycleprocessing
1158+
expr: |
1159+
max by(cluster, namespace, pod) (histogram_count(increase(cortex_blockbuilder_consume_cycle_duration_seconds[60m]))) == 0
1160+
for: 5m
1161+
labels:
1162+
severity: warning
1163+
- alert: MimirBlockBuilderLagging
1164+
annotations:
1165+
message: Mimir {{ $labels.pod }} in {{ $labels.cluster }}/{{ $labels.namespace }} reports partition lag of {{ printf "%.2f" $value }}%.
1166+
runbook_url: https://grafana.com/docs/mimir/latest/operators-guide/mimir-runbooks/#mimirblockbuilderlagging
1167+
expr: |
1168+
max by(cluster, namespace, pod) (max_over_time(cortex_blockbuilder_consumer_lag_records[10m])) > 4e6
1169+
for: 75m
1170+
labels:
1171+
severity: warning
1172+
- alert: MimirBlockBuilderCompactAndUploadFailed
1173+
annotations:
1174+
message: Mimir {{ $labels.pod }} in {{ $labels.cluster }}/{{ $labels.namespace }} fails to compact and upload blocks.
1175+
runbook_url: https://grafana.com/docs/mimir/latest/operators-guide/mimir-runbooks/#mimirblockbuildercompactanduploadfailed
1176+
expr: |
1177+
sum by (cluster, namespace, pod) (rate(cortex_blockbuilder_tsdb_compact_and_upload_failed_total[1m])) > 0
1178+
for: 5m
1179+
labels:
1180+
severity: warning
11541181
- name: mimir_continuous_test
11551182
rules:
11561183
- alert: MimirContinuousTestNotRunningOnWrites

operations/mimir-mixin/alerts/ingest-storage.libsonnet

Lines changed: 47 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -212,6 +212,53 @@
212212
message: '%(product)s {{ $labels.%(per_instance_label)s }} in %(alert_aggregation_variables)s Kafka client produce buffer utilization is {{ printf "%%.2f" $value }}%%.' % $._config,
213213
},
214214
},
215+
216+
// Alert if block-builder didn't process cycles in the past hour.
217+
{
218+
alert: $.alertName('BlockBuilderNoCycleProcessing'),
219+
'for': '5m',
220+
expr: |||
221+
max by(%(alert_aggregation_labels)s, %(per_instance_label)s) (histogram_count(increase(cortex_blockbuilder_consume_cycle_duration_seconds[60m]))) == 0
222+
||| % $._config,
223+
labels: {
224+
severity: 'warning',
225+
},
226+
annotations: {
227+
message: '%(product)s {{ $labels.%(per_instance_label)s }} in %(alert_aggregation_variables)s has not processed cycles in the past hour.' % $._config,
228+
},
229+
},
230+
231+
// Alert if block-builder per partition lag is higher than the threshhold.
232+
// The value of the threshhold is arbitary large for now. We will reconsider this alert after we get the block-builder-scheduler.
233+
// Note on "for: 75m": we assume one cycle is 1hr; with 10m loopback we expect the warning to trigger only if the metric is above the threshold for more than one cycle.
234+
{
235+
alert: $.alertName('BlockBuilderLagging'),
236+
'for': '75m',
237+
expr: |||
238+
max by(%(alert_aggregation_labels)s, %(per_instance_label)s) (max_over_time(cortex_blockbuilder_consumer_lag_records[10m])) > 4e6
239+
||| % $._config,
240+
labels: {
241+
severity: 'warning',
242+
},
243+
annotations: {
244+
message: '%(product)s {{ $labels.%(per_instance_label)s }} in %(alert_aggregation_variables)s reports partition lag of {{ printf "%%.2f" $value }}%%.' % $._config,
245+
},
246+
},
247+
248+
// Alert if block-builder is failing to compact and upload any blocks.
249+
{
250+
alert: $.alertName('BlockBuilderCompactAndUploadFailed'),
251+
'for': '5m',
252+
expr: |||
253+
sum by (%(alert_aggregation_labels)s, %(per_instance_label)s) (rate(cortex_blockbuilder_tsdb_compact_and_upload_failed_total[1m])) > 0
254+
||| % $._config,
255+
labels: {
256+
severity: 'warning',
257+
},
258+
annotations: {
259+
message: '%(product)s {{ $labels.%(per_instance_label)s }} in %(alert_aggregation_variables)s fails to compact and upload blocks.' % $._config,
260+
},
261+
},
215262
],
216263
},
217264
],

0 commit comments

Comments
 (0)