Reconfigure storage metrics/alarms to use available disk space #103

jacobwinch · 2024-04-25T10:16:19Z

What does this change?

We've recently had some problems with nodes running out of disk space (see https://github.com/guardian/deploy-tools-platform/pull/754 for more details). Despite this, we haven't been receiving alarms related to disk space. I believe this is because we are using a different property to Elasticsearch/Cerebro to evaluate disk usage.

https://www.elastic.co/guide/en/elasticsearch/reference/current/cluster-nodes-stats.html

Currently we use:

free_in_bytes
(integer) Total number of unallocated bytes in all file stores.

This PR updates the code to use:

available_in_bytes
(integer) Total number of bytes available to this Java virtual machine on all file stores. Depending on OS or process level restrictions (e.g. XFS quotas), this might appear less than free_in_bytes. This is the actual amount of free disk space the Elasticsearch node can utilise.

I think that last sentence (emphasis mine), suggests that available_in_bytes is the more useful metric to monitor.

How to test

This service only has a PROD environment, so I will double check that all of the metrics look correct after merging.

How can we measure success?

We should get a more timely alarm if nodes are running out of disk space.

Have we considered potential risks?

We will stop pushing new data points for the MinFreeDiskSpace and SumFreeDiskSpace metrics. I'm pretty sure that the alarm (also reconfigured in this PR) is the only thing looking at these metrics, but there is a small risk that I've forgotten something!

akash1810 · 2024-04-25T11:00:45Z

I'm pretty sure that the alarm (also reconfigured in this PR) is the only thing looking at these metrics, but there is a small risk that I've forgotten something!

Looks like this service is deployed only into the DevX account, so impact is minimal 🎉. Wonder if other teams would find this service helpful?

jacobwinch · 2024-04-25T11:12:56Z

Wonder if other teams would find this service helpful?

Yes, it feels generic enough to be useful for other Elasticsearch users - I'm not sure how other teams are keeping track of these metrics at the moment.

jacobwinch · 2024-04-25T11:23:51Z

This service only has a PROD environment, so I will double check that all of the metrics look correct after merging.

The new alarm configuration looks OK:

And the metrics are coming through:

This side by side comparison helps to demonstrate that things are a bit worse than previously thought!

Reconfigure storage metrics/alarms to use available disk space

ae02056

jacobwinch requested a review from michaelwmcnamara April 25, 2024 10:52

akash1810 approved these changes Apr 25, 2024

View reviewed changes

jacobwinch merged commit ad50f6d into main Apr 25, 2024
1 check passed

jacobwinch deleted the jw-available-storage branch April 25, 2024 11:11

jacobwinch mentioned this pull request May 7, 2024

Use 20% threshold for low disk space alarm #107

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reconfigure storage metrics/alarms to use available disk space #103

Reconfigure storage metrics/alarms to use available disk space #103

jacobwinch commented Apr 25, 2024

akash1810 commented Apr 25, 2024

jacobwinch commented Apr 25, 2024

jacobwinch commented Apr 25, 2024

Reconfigure storage metrics/alarms to use available disk space #103

Reconfigure storage metrics/alarms to use available disk space #103

Conversation

jacobwinch commented Apr 25, 2024

What does this change?

How to test

How can we measure success?

Have we considered potential risks?

akash1810 commented Apr 25, 2024

jacobwinch commented Apr 25, 2024

jacobwinch commented Apr 25, 2024