Skip to content

Add or update runbooks for alarms of anomalous behavior #3420

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 7 commits into from
Dec 6, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
@@ -1,12 +1,12 @@
# Run Book: API Production Request Count above threshold
# Run Book: API Production Request Count anomalously high

```{admonition} Metadata
Status: **Unstable**

Maintainer: @krysaldb

Alarm link:
- <https://us-east-1.console.aws.amazon.com/cloudwatch/home?region=us-east-1#alarmsV2:alarm/API+Production+Request+Count+above+threshold?>
- <https://us-east-1.console.aws.amazon.com/cloudwatch/home?region=us-east-1#alarmsV2:alarm/API+Production+Request+Count+anomalously+high?>
```

## Severity Guide
Expand All @@ -19,10 +19,10 @@ future resource scaling depending on the kind of traffic.
If the services are strained then the severity is critical, search for the root
cause to prevent more serious outages. If there are no recent obvious
integrations (like the Gutenberg plugin) then follow the run book to [identify
traffic anomalies in Cloudflare][runbook_traffic], to determine whether the
recent traffic is organic or if it comes from a botnet. Find the origin of
requests and evaluate whether it needs to be blocked or if Openverse services
need to adapt to the new demand.
traffic anomalies][runbook_traffic], to determine whether the recent traffic is
organic or if it comes from a botnet. Find the origin of requests and evaluate
whether it needs to be blocked or if Openverse services need to adapt to the new
demand.

[runbook_traffic]:
https://docs.openverse.org/meta/monitoring/traffic/runbooks/identifying-and-blocking-traffic-anomalies.html
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,9 @@

```{admonition} Metadata
Status: **Unstable**

Maintainer: @stacimc

Alarm link:
- <https://us-east-1.console.aws.amazon.com/cloudwatch/home?region=us-east-1#alarmsV2:alarm/API+Thumbnails+Production+Average+Response+Time+above+threshold>
```
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,9 @@

```{admonition} Metadata
Status: **Unstable**

Maintainer: @stacimc

Alarm link:
- <https://us-east-1.console.aws.amazon.com/cloudwatch/home?region=us-east-1#alarmsV2:alarm/API+Thumbnails+Production+Average+Response+Time+anomalously+high>
```
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,9 @@

```{admonition} Metadata
Status: **Unstable**

Maintainer: @stacimc

Alarm link:
- <https://us-east-1.console.aws.amazon.com/cloudwatch/home?region=us-east-1#alarmsV2:alarm/API+Thumbnails+Production+P99+Response+Time+above+threshold>
```
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,9 @@

```{admonition} Metadata
Status: **Unstable**

Maintainer: @stacimc

Alarm link:
- <https://us-east-1.console.aws.amazon.com/cloudwatch/home?region=us-east-1#alarmsV2:alarm/API+Thumbnails+Production+P99+Response+Time+anomalously+high>
```
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,9 @@

```{admonition} Metadata
Status: **Unstable**

Maintainer: @krysaldb

Alarm link:
- <https://us-east-1.console.aws.amazon.com/cloudwatch/home?region=us-east-1#alarmsV2:alarm/API+Thumbnails+Production+Request+Count+anomalously+high>
```
Expand Down
10 changes: 6 additions & 4 deletions documentation/meta/monitoring/runbooks/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,24 +12,26 @@ that can be a good resource when writing a new one.
```{toctree}
:titlesonly:

api_request_count_above_threshold
api_http_2xx_under_threshold
api_http_5xx_above_threshold
api_avg_response_time_above_threshold
api_avg_response_time_anomaly
api_p99_response_time_above_threshold
api_p99_response_time_anomaly
api_request_count_anomaly
api_thumbnails_http_2xx_under_threshold
api_thumbnails_http_5xx_above_threshold
api_thumbnails_request_count_anomaly
api_thumbnails_avg_response_time_above_threshold
api_thumbnails_avg_response_time_anomaly
api_thumbnails_p99_response_time_above_threshold
api_thumbnails_p99_response_time_anomaly
nuxt_request_count
nuxt_2xx_under_threshold
nuxt_5xx_above_threshold
nuxt_http_2xx_under_threshold
nuxt_http_5xx_above_threshold
nuxt_avg_response_time_above_threshold
nuxt_avg_response_time_anomaly
nuxt_p99_response_time_above_threshold
nuxt_p99_response_time_anomaly
nuxt_request_count_anomaly
unhealthy_ecs_hosts
```
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,9 @@

```{admonition} Metadata
Status: **Unstable**

Maintainer: @obulat

Alarm link:
- <https://us-east-1.console.aws.amazon.com/cloudwatch/home?region=us-east-1#alarmsV2:alarm/Nuxt+Production+Average+Response+Time+above+threshold?>
```
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
# Run Book: Nuxt Production Average Response Time anomalously high

```{admonition} Metadata
Status: **Unstable**

Maintainer: @obulat

Alarm link:
- <https://us-east-1.console.aws.amazon.com/cloudwatch/home?region=us-east-1#alarmsV2:alarm/Nuxt+Production+Average+Response+Time+anomalously+high?>
```

## Severity Guide

Confirm that there is not a total outage of the service. If not, the severity is
likely low. Check for the request count and general network activity. If
abnormally high, refer to the [traffic analysis run book][traffic_runbook] to
identify and block any malicious traffic. If not, then check for a recent
deployment that may have introduced a problem, and [rollback][rollback_docs] to
the previous version if necessary.

[traffic_runbook]:
/meta/monitoring/traffic/runbooks/identifying-and-blocking-traffic-anomalies.md
[rollback_docs]: /general/deployment.md#rollbacks

## Historical false positives

Nothing registered to date.

## Related incident reports

Nothing registered to date.
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# Run Book: Nuxt 2XX request count under threshold
# Run Book: Nuxt 2XX responses count under threshold

```{admonition} Metadata
Status: **Unstable**
Expand Down
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# Run Book: Nuxt 5XX request count above threshold
# Run Book: Nuxt 5XX responses count above threshold

```{admonition} Metadata
Status: **Unstable**
Expand Down
Original file line number Diff line number Diff line change
@@ -1,8 +1,10 @@
# Run Book: Nuxt Production Average Response Time above threshold
# Run Book: Nuxt Production P99 Response Time above threshold

```{admonition} Metadata
Status: **Unstable**
Status: **Disabled** until Nuxt request logging is added.

Maintainer: @obulat

Alarm link:
- <https://us-east-1.console.aws.amazon.com/cloudwatch/home?region=us-east-1#alarmsV2:alarm/Nuxt+Production+P99+Response+Time+above+threshold?>
```
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
# Run Book: Nuxt Production P99 Response Time anomalously high

```{admonition} Metadata
Status: **Disabled** until Nuxt request logging is added.

Maintainer: @obulat

Alarm link:
- <https://us-east-1.console.aws.amazon.com/cloudwatch/home?region=us-east-1#alarmsV2:alarm/Nuxt+Production+P99+Response+Time+anomalously+high>
```

## Severity Guide

Confirm that there is not a total outage of the service. If not, the severity is
likely low. Check for the request count and general network activity. If
abnormally high, refer to the [traffic analysis run book][traffic_runbook] to
identify and block any malicious traffic. If not, then check for a recent
deployment that may have introduced a problem, and [rollback][rollback_docs] to
the previous version if necessary.

[traffic_runbook]:
/meta/monitoring/traffic/runbooks/identifying-and-blocking-traffic-anomalies.md
[rollback_docs]: /general/deployment.md#rollbacks

## Historical false positives

Nothing registered to date.

## Related incident reports

Nothing registered to date.
Original file line number Diff line number Diff line change
@@ -1,12 +1,12 @@
# Run Book: Nuxt request count above threshold
# Run Book: Nuxt Request Count anomalously high

```{admonition} Metadata
Status: **Unstable**

Maintainer: @dhruvkb

Alarm link:
- [production-nuxt](https://us-east-1.console.aws.amazon.com/cloudwatch/home?region=us-east-1#alarmsV2:alarm/Nuxt+Production+request+count+above+threshold)
- [production-nuxt](https://us-east-1.console.aws.amazon.com/cloudwatch/home?region=us-east-1#alarmsV2:alarm/Nuxt+Production+Request+Count+anomalously+high)
```

## Severity guide
Expand Down