Reconfigure storage metrics/alarms to use available disk space #103
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
What does this change?
We've recently had some problems with nodes running out of disk space (see https://github.com/guardian/deploy-tools-platform/pull/754 for more details). Despite this, we haven't been receiving alarms related to disk space. I believe this is because we are using a different property to Elasticsearch/Cerebro to evaluate disk usage.
https://www.elastic.co/guide/en/elasticsearch/reference/current/cluster-nodes-stats.html
Currently we use:
This PR updates the code to use:
I think that last sentence (emphasis mine), suggests that
available_in_bytes
is the more useful metric to monitor.How to test
This service only has a
PROD
environment, so I will double check that all of the metrics look correct after merging.How can we measure success?
We should get a more timely alarm if nodes are running out of disk space.
Have we considered potential risks?
We will stop pushing new data points for the
MinFreeDiskSpace
andSumFreeDiskSpace
metrics. I'm pretty sure that the alarm (also reconfigured in this PR) is the only thing looking at these metrics, but there is a small risk that I've forgotten something!