Tolerate yellow cluster status for longer before sending an alert #63

jacobwinch · 2023-09-04T13:31:11Z

Follow-up to #8.

We have received a notification for this alarm a few times recently. It seems pretty common for the cluster status to be yellow for ~20 minutes before automatically resolving (i.e. it becomes green again without any developers taking an action). This now happens frequently enough that developers (including me!) often ignore the email.

We seem to have tacitly accepted the risk of the cluster status being yellow for longer than 15 minutes¹, so this PR makes that risk acceptance explicit. This helps to reduce noise and increases the likelihood that we'll respond to more meaningful alarm notifications in the future.

There is no user impact associated with yellow status; it just means that we've lost redundancy (i.e. we might lose data if we lose another data node). ↩

jorgeazevedo

Thanks 🙏

Tolerate yellow cluster status for longer before sending an alert

326e0fe

jacobwinch requested review from a team September 4, 2023 13:36

jorgeazevedo approved these changes Sep 5, 2023

View reviewed changes

jacobwinch merged commit 95638f3 into main Sep 6, 2023
1 check passed

jacobwinch deleted the jw-yellow-tolerance branch September 6, 2023 07:43

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tolerate yellow cluster status for longer before sending an alert #63

Tolerate yellow cluster status for longer before sending an alert #63

jacobwinch commented Sep 4, 2023 •

edited

Loading

jorgeazevedo left a comment

Tolerate yellow cluster status for longer before sending an alert #63

Tolerate yellow cluster status for longer before sending an alert #63

Conversation

jacobwinch commented Sep 4, 2023 • edited Loading

Footnotes

jorgeazevedo left a comment

Choose a reason for hiding this comment

jacobwinch commented Sep 4, 2023 •

edited

Loading