forked from cloudfoundry/diego-release
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathindicators.yml.erb
131 lines (108 loc) · 6.61 KB
/
indicators.yml.erb
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
---
apiVersion: indicatorprotocol.io/v1
kind: IndicatorDocument
metadata:
labels:
deployment: <%= spec.deployment %>
component: rep_windows
spec:
product:
name: diego
version: latest
indicators:
- name: capacity_remaining_memory
promql: min_over_time(CapacityRemainingMemory{source_id="rep"}[5m]) / 1024
documentation:
title: Diego Cell - Remaining Memory Available - Overall Remaining Memory Available
description: |
Remaining amount of memory in MiB available for this Diego cell to allocate to containers.
Use: Can indicate low memory capacity overall in the platform. Low memory can prevent app scaling and new deployments. The overall sum of capacity can indicate that you need to scale the platform. Observing capacity consumption trends over time helps with capacity planning.
Origin: Firehose
Type: Gauge (Integer in MiB)
Frequency: 60 s
recommended_alert_thresholds: < 35%
recommended_response: |
1. Assign more resources to the cells
2. Assign more cells.
- name: capacity_remaining_disk
promql: min_over_time(CapacityRemainingDisk{source_id="rep"}[5m]) / 1024
documentation:
title: Diego Cell - Remaining Disk Available - Overall Remaining Disk Available
description: |
Remaining amount of disk in MiB available for this Diego cell to allocate to containers.
Use: Low disk capacity can prevent app scaling and new deployments. Because Diego staging Tasks can fail without at least 6 GB free, the recommended red threshold is based on the minimum disk capacity across the deployment falling below 6 GB in the previous 5 minutes.
It can also be meaningful to assess how many chunks of free disk space are above a given threshold, similar to rep.CapacityRemainingMemory.
Origin: Firehose
Type: Gauge (Integer in MiB)
Frequency: 60 s
recommended_alert_thresholds: < 35%
recommended_response: |
1. Assign more resources to the cells.
2. Assign more cells.
- name: cell_unhealthy
promql: max_over_time(CellUnhealthy{source_id="rep"}[5m])
documentation:
title: Diego Cell - Cell Unhealthy
description: |
The Diego cell is unhealthy and unrecoverable if healthcheck container reaches its timeout. For Diego cells, 1 means unhealthy.
Use: If one cell is impacted, it does not participate in auctions and will require a re-create of the cell, but end-user impact is usually low.
Origin: Firehose
Type: Gauge (Float, 1)
Frequency: Once, during event.
recommended_alert_thresholds: |
Yellow warning: = 1
recommended_response: |
1. Investigate Diego cell servers for faults and errors.
2. recreate the cell by running bosh recreate. See the BOSH documentation for bosh recreate command syntax.
- name: garden_health_check_failed
promql: max_over_time(GardenHealthCheckFailed{source_id="rep"}[5m])
documentation:
title: Diego Cell - Garden Healthcheck Failed
description: |
The Diego cell periodically checks its health against the garden backend. For Diego cells, 0 means healthy, and 1 means unhealthy.
Use: Set an alert for further investigation if multiple unhealthy Diego cells are detected in the given time window. If one cell is impacted, it does not participate in auctions, but end-user impact is usually low. If multiple cells are impacted, this can indicate a larger problem with Diego, and should be considered a more critical investigation need.
Suggested alert threshold based on multiple unhealthy cells in the given time window.
Although end-user impact is usually low if only one cell is impacted, this should still be investigated. Particularly in a lower capacity environment, this situation could result in negative end-user impact if left unresolved.
Origin: Firehose
Type: Gauge (Float, 0-1)
Frequency: 30 s
recommended_alert_thresholds: |
Yellow warning: = 1
Red critical: > 1
recommended_response: |
1. Investigate Diego cell servers for faults and errors.
2. If a particular cell or cells appear problematic:
a. Determine a time interval during which the metrics from the cell changed from healthy to unhealthy.
b. Pull the logs that the cell generated over that interval. The Cell ID is the same as the BOSH instance ID.
c. Pull the BBS logs over that same time interval.
3. As a last resort, it sometimes helps to recreate the cell by running bosh recreate. See the BOSH documentation for bosh recreate command syntax.
- name: app_instance_exceeded_log_rate_limit_count
promql: rate(AppInstanceExceededLogRateLimitCount{source_id="rep"}[5m]) * 60
documentation:
title: Diego Cell - App Instances Exceeded Log Rate Limit
description: |
The number of application instances that exceeded log rate limit. This metric is emitted once per 5 minutes only for each app instance.
Use: Indicates the number of application instances that are being log rate limitted. It can be used to analyse and adjust the log rate limit property in Diego or as an indicator that operators should reach application developers to reduce the volume of logs being emitted by their applications.
Origin: Firehose
Type: Counter (Integer)
Frequency: During event, when app instance exceeds log rate limit within last 5 minutes
recommended_alert_thresholds: Dynamic
recommended_response: |
1. If this number is too high analyze whether the Diego log rate limit property needs to be adjusted to allow higher volume of logs being emitted by applications.
2. Using cf plugin "top" find top applications that generate high volume of logs and recommend their developers to reduce the volume of logs generated by their applications.
- name: rep_bulk_sync_duration
promql: max_over_time(RepBulkSyncDuration{source_id="rep"}[15m]) / 1000000000
documentation:
title: Diego Cell - Time to Sync
description: |
Time in ns that the Diego Cell Rep took to sync the ActualLRPs that it claimed with its actual garden containers.
Use: Sync times that are too high can indicate issues with the BBS.
Origin: Firehose
Type: Gauge (Float in ns)
Frequency: 30 s
recommended_alert_thresholds: |
Yellow warning: ≥ 5 s
Red critical: ≥ 10 s
recommended_response: |
1. Investigate BBS logs for faults and errors.
2. If a particular cell or cells appear problematic, investigate logs for the cells.