Skip to content

Commit

Permalink
Add runbooks for API Thumbnails 2XX/5XX responses and Request Count a…
Browse files Browse the repository at this point in the history
…larms (#3076)
  • Loading branch information
krysal authored Oct 2, 2023
1 parent b94e60f commit 441ce0f
Show file tree
Hide file tree
Showing 4 changed files with 104 additions and 0 deletions.
Original file line number Diff line number Diff line change
@@ -0,0 +1,36 @@
# Run Book: API Thumbnails Production HTTP 2XX responses count under threshold

```{admonition} Metadata
Status: **Unstable**
Maintainer: @krysaldb
Alarm link:
- <https://us-east-1.console.aws.amazon.com/cloudwatch/home?region=us-east-1#alarmsV2:alarm/API+Thumbnails+Production+HTTP+2XX+responses+count+under+threshold?>
```

## Severity Guide

After confirming there is not a total outage, check if the overall request count
has decreased as well (go to the [CloudWatch dashboard][cloudwatch] or
alternatively check in CloudFlare). If the overall requests are lower then the
severity is low, and you should continue searching for the cause of the general
decrease.

If the lower number is only in 2XX responses the severity is likely high, so
also check the dashboard to look for other anomalies. Verify if any of the
thumbnail providers are experiencing an outage or are rate-limiting Openverse.
Go to the [API logs][api_logs] to check for errors or data that yield clues.

[cloudwatch]:
https://us-east-1.console.aws.amazon.com/cloudwatch/home?region=us-east-1#dashboards/dashboard/ECS-Production-Dashboard
[api_logs]:
https://us-east-1.console.aws.amazon.com/cloudwatch/home?region=us-east-1#logsV2:log-groups/log-group/$252Fecs$252Fproduction$252Fapi

## Historical false positives

Nothing registered to date.

## Related incident reports

Nothing registered to date.
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
# Run Book: API Thumbnails Production HTTP 5XX responses count above threshold

```{admonition} Metadata
Status: **Unstable**
Maintainer: @krysaldb
Alarm link:
- <https://us-east-1.console.aws.amazon.com/cloudwatch/home?region=us-east-1#alarmsV2:alarm/API+Thumbnails+Production+HTTP+5XX+responses+count+under+threshold?>
```

## Severity Guide

After confirming there is not a total outage, check if the increase of 5XX HTTP
errors is related to a regular time where resources are expected to be
constrained like a recent deployment, a data refresh, DB maintenance, etc. If
the spike is related to one of these events and the alarms stabilizes in the
short time then the severity is low.

If the issue is not related to known recurrent events and persists, the severity
is critical. Check if dependent services –DB, Redis, Elasticsearch– are
available to the API or if the problem is intrinsic to itself.

## Historical false positives

Nothing registered to date.

## Related incident reports

- 2023-09-27 from 16:15 to 18:00 UTC: 5XX responses spike between 983 and 1434
due to unstable tasks.
Original file line number Diff line number Diff line change
@@ -0,0 +1,34 @@
# Run Book: API Thumbnails Production Request Count anomalously high

```{admonition} Metadata
Status: **Unstable**
Maintainer: @krysaldb
Alarm link:
- <https://us-east-1.console.aws.amazon.com/cloudwatch/home?region=us-east-1#alarmsV2:alarm/API+Thumbnails+Production+Request+Count+anomalously+high>
```

## Severity Guide

When a sudden increase in request count is noticed, verify that the services are
supporting the load by looking at metrics like response time or ES CPU usage for
example. If the API is doing fine, then severity is low and may only require
future resource scaling depending on the kind of traffic.

If the services are strained then the severity is critical, search for the root
cause to prevent more serious outages. If there are no recent obvious
integrations (like the Gutenberg plugin) then follow the run book to [identify
traffic anomalies in Cloudflare][runbook_traffic], to determine whether the
recent traffic is organic or if it comes from a botnet. Find the origin of
requests and evaluate whether it needs to be blocked or if Openverse services
need to adapt to the new demand.

[traffic_runbook]:
/meta/monitoring/traffic/runbooks/identifying-and-blocking-traffic-anomalies.md

## Historical false positives

Nothing registered to date.

## Related incident reports

Nothing registered to date.
3 changes: 3 additions & 0 deletions documentation/meta/monitoring/runbooks/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,9 @@ api_avg_response_time_above_threshold
api_avg_response_time_anomaly
api_p99_response_time_above_threshold
api_p99_response_time_anomaly
api_thumbnails_http_2xx_under_threshold
api_thumbnails_http_5xx_above_threshold
api_thumbnails_request_count_anomaly
api_thumbnails_avg_response_time_above_threshold
api_thumbnails_avg_response_time_anomaly
api_thumbnails_p99_response_time_above_threshold
Expand Down

0 comments on commit 441ce0f

Please sign in to comment.