Skip to content

Commit

Permalink
Add or update runbooks for alarms of anomalous behavior (#3420)
Browse files Browse the repository at this point in the history
  • Loading branch information
krysal authored Dec 6, 2023
1 parent 9b44496 commit cd869bf
Show file tree
Hide file tree
Showing 14 changed files with 94 additions and 16 deletions.
Original file line number Diff line number Diff line change
@@ -1,12 +1,12 @@
# Run Book: API Production Request Count above threshold
# Run Book: API Production Request Count anomalously high

```{admonition} Metadata
Status: **Unstable**
Maintainer: @krysaldb
Alarm link:
- <https://us-east-1.console.aws.amazon.com/cloudwatch/home?region=us-east-1#alarmsV2:alarm/API+Production+Request+Count+above+threshold?>
- <https://us-east-1.console.aws.amazon.com/cloudwatch/home?region=us-east-1#alarmsV2:alarm/API+Production+Request+Count+anomalously+high?>
```

## Severity Guide
Expand All @@ -19,10 +19,10 @@ future resource scaling depending on the kind of traffic.
If the services are strained then the severity is critical, search for the root
cause to prevent more serious outages. If there are no recent obvious
integrations (like the Gutenberg plugin) then follow the run book to [identify
traffic anomalies in Cloudflare][runbook_traffic], to determine whether the
recent traffic is organic or if it comes from a botnet. Find the origin of
requests and evaluate whether it needs to be blocked or if Openverse services
need to adapt to the new demand.
traffic anomalies][runbook_traffic], to determine whether the recent traffic is
organic or if it comes from a botnet. Find the origin of requests and evaluate
whether it needs to be blocked or if Openverse services need to adapt to the new
demand.

[runbook_traffic]:
https://docs.openverse.org/meta/monitoring/traffic/runbooks/identifying-and-blocking-traffic-anomalies.html
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,9 @@

```{admonition} Metadata
Status: **Unstable**
Maintainer: @stacimc
Alarm link:
- <https://us-east-1.console.aws.amazon.com/cloudwatch/home?region=us-east-1#alarmsV2:alarm/API+Thumbnails+Production+Average+Response+Time+above+threshold>
```
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,9 @@

```{admonition} Metadata
Status: **Unstable**
Maintainer: @stacimc
Alarm link:
- <https://us-east-1.console.aws.amazon.com/cloudwatch/home?region=us-east-1#alarmsV2:alarm/API+Thumbnails+Production+Average+Response+Time+anomalously+high>
```
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,9 @@

```{admonition} Metadata
Status: **Unstable**
Maintainer: @stacimc
Alarm link:
- <https://us-east-1.console.aws.amazon.com/cloudwatch/home?region=us-east-1#alarmsV2:alarm/API+Thumbnails+Production+P99+Response+Time+above+threshold>
```
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,9 @@

```{admonition} Metadata
Status: **Unstable**
Maintainer: @stacimc
Alarm link:
- <https://us-east-1.console.aws.amazon.com/cloudwatch/home?region=us-east-1#alarmsV2:alarm/API+Thumbnails+Production+P99+Response+Time+anomalously+high>
```
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,9 @@

```{admonition} Metadata
Status: **Unstable**
Maintainer: @krysaldb
Alarm link:
- <https://us-east-1.console.aws.amazon.com/cloudwatch/home?region=us-east-1#alarmsV2:alarm/API+Thumbnails+Production+Request+Count+anomalously+high>
```
Expand Down
10 changes: 6 additions & 4 deletions documentation/meta/monitoring/runbooks/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,24 +12,26 @@ that can be a good resource when writing a new one.
```{toctree}
:titlesonly:
api_request_count_above_threshold
api_http_2xx_under_threshold
api_http_5xx_above_threshold
api_avg_response_time_above_threshold
api_avg_response_time_anomaly
api_p99_response_time_above_threshold
api_p99_response_time_anomaly
api_request_count_anomaly
api_thumbnails_http_2xx_under_threshold
api_thumbnails_http_5xx_above_threshold
api_thumbnails_request_count_anomaly
api_thumbnails_avg_response_time_above_threshold
api_thumbnails_avg_response_time_anomaly
api_thumbnails_p99_response_time_above_threshold
api_thumbnails_p99_response_time_anomaly
nuxt_request_count
nuxt_2xx_under_threshold
nuxt_5xx_above_threshold
nuxt_http_2xx_under_threshold
nuxt_http_5xx_above_threshold
nuxt_avg_response_time_above_threshold
nuxt_avg_response_time_anomaly
nuxt_p99_response_time_above_threshold
nuxt_p99_response_time_anomaly
nuxt_request_count_anomaly
unhealthy_ecs_hosts
```
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,9 @@

```{admonition} Metadata
Status: **Unstable**
Maintainer: @obulat
Alarm link:
- <https://us-east-1.console.aws.amazon.com/cloudwatch/home?region=us-east-1#alarmsV2:alarm/Nuxt+Production+Average+Response+Time+above+threshold?>
```
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
# Run Book: Nuxt Production Average Response Time anomalously high

```{admonition} Metadata
Status: **Unstable**
Maintainer: @obulat
Alarm link:
- <https://us-east-1.console.aws.amazon.com/cloudwatch/home?region=us-east-1#alarmsV2:alarm/Nuxt+Production+Average+Response+Time+anomalously+high?>
```

## Severity Guide

Confirm that there is not a total outage of the service. If not, the severity is
likely low. Check for the request count and general network activity. If
abnormally high, refer to the [traffic analysis run book][traffic_runbook] to
identify and block any malicious traffic. If not, then check for a recent
deployment that may have introduced a problem, and [rollback][rollback_docs] to
the previous version if necessary.

[traffic_runbook]:
/meta/monitoring/traffic/runbooks/identifying-and-blocking-traffic-anomalies.md
[rollback_docs]: /general/deployment.md#rollbacks

## Historical false positives

Nothing registered to date.

## Related incident reports

Nothing registered to date.
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# Run Book: Nuxt 2XX request count under threshold
# Run Book: Nuxt 2XX responses count under threshold

```{admonition} Metadata
Status: **Unstable**
Expand Down
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# Run Book: Nuxt 5XX request count above threshold
# Run Book: Nuxt 5XX responses count above threshold

```{admonition} Metadata
Status: **Unstable**
Expand Down
Original file line number Diff line number Diff line change
@@ -1,8 +1,10 @@
# Run Book: Nuxt Production Average Response Time above threshold
# Run Book: Nuxt Production P99 Response Time above threshold

```{admonition} Metadata
Status: **Unstable**
Status: **Disabled** until Nuxt request logging is added.
Maintainer: @obulat
Alarm link:
- <https://us-east-1.console.aws.amazon.com/cloudwatch/home?region=us-east-1#alarmsV2:alarm/Nuxt+Production+P99+Response+Time+above+threshold?>
```
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
# Run Book: Nuxt Production P99 Response Time anomalously high

```{admonition} Metadata
Status: **Disabled** until Nuxt request logging is added.
Maintainer: @obulat
Alarm link:
- <https://us-east-1.console.aws.amazon.com/cloudwatch/home?region=us-east-1#alarmsV2:alarm/Nuxt+Production+P99+Response+Time+anomalously+high>
```

## Severity Guide

Confirm that there is not a total outage of the service. If not, the severity is
likely low. Check for the request count and general network activity. If
abnormally high, refer to the [traffic analysis run book][traffic_runbook] to
identify and block any malicious traffic. If not, then check for a recent
deployment that may have introduced a problem, and [rollback][rollback_docs] to
the previous version if necessary.

[traffic_runbook]:
/meta/monitoring/traffic/runbooks/identifying-and-blocking-traffic-anomalies.md
[rollback_docs]: /general/deployment.md#rollbacks

## Historical false positives

Nothing registered to date.

## Related incident reports

Nothing registered to date.
Original file line number Diff line number Diff line change
@@ -1,12 +1,12 @@
# Run Book: Nuxt request count above threshold
# Run Book: Nuxt Request Count anomalously high

```{admonition} Metadata
Status: **Unstable**
Maintainer: @dhruvkb
Alarm link:
- [production-nuxt](https://us-east-1.console.aws.amazon.com/cloudwatch/home?region=us-east-1#alarmsV2:alarm/Nuxt+Production+request+count+above+threshold)
- [production-nuxt](https://us-east-1.console.aws.amazon.com/cloudwatch/home?region=us-east-1#alarmsV2:alarm/Nuxt+Production+Request+Count+anomalously+high)
```

## Severity guide
Expand Down

0 comments on commit cd869bf

Please sign in to comment.