Tolerate yellow cluster status for longer before sending an alert #63
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Follow-up to #8.
We have received a notification for this alarm a few times recently. It seems pretty common for the cluster status to be yellow for ~20 minutes before automatically resolving (i.e. it becomes green again without any developers taking an action). This now happens frequently enough that developers (including me!) often ignore the email.
We seem to have tacitly accepted the risk of the cluster status being yellow for longer than 15 minutes1, so this PR makes that risk acceptance explicit. This helps to reduce noise and increases the likelihood that we'll respond to more meaningful alarm notifications in the future.
Footnotes
There is no user impact associated with yellow status; it just means that we've lost redundancy (i.e. we might lose data if we lose another data node). ↩