Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reconfigure storage metrics/alarms to use available disk space #103

Merged
merged 1 commit into from
Apr 25, 2024

Conversation

jacobwinch
Copy link
Contributor

What does this change?

We've recently had some problems with nodes running out of disk space (see https://github.com/guardian/deploy-tools-platform/pull/754 for more details). Despite this, we haven't been receiving alarms related to disk space. I believe this is because we are using a different property to Elasticsearch/Cerebro to evaluate disk usage.

https://www.elastic.co/guide/en/elasticsearch/reference/current/cluster-nodes-stats.html

Currently we use:

free_in_bytes
(integer) Total number of unallocated bytes in all file stores.

This PR updates the code to use:

available_in_bytes
(integer) Total number of bytes available to this Java virtual machine on all file stores. Depending on OS or process level restrictions (e.g. XFS quotas), this might appear less than free_in_bytes. This is the actual amount of free disk space the Elasticsearch node can utilise.

I think that last sentence (emphasis mine), suggests that available_in_bytes is the more useful metric to monitor.

How to test

This service only has a PROD environment, so I will double check that all of the metrics look correct after merging.

How can we measure success?

We should get a more timely alarm if nodes are running out of disk space.

Have we considered potential risks?

We will stop pushing new data points for the MinFreeDiskSpace and SumFreeDiskSpace metrics. I'm pretty sure that the alarm (also reconfigured in this PR) is the only thing looking at these metrics, but there is a small risk that I've forgotten something!

@akash1810
Copy link
Member

I'm pretty sure that the alarm (also reconfigured in this PR) is the only thing looking at these metrics, but there is a small risk that I've forgotten something!

Looks like this service is deployed only into the DevX account, so impact is minimal 🎉. Wonder if other teams would find this service helpful?

@jacobwinch jacobwinch merged commit ad50f6d into main Apr 25, 2024
1 check passed
@jacobwinch jacobwinch deleted the jw-available-storage branch April 25, 2024 11:11
@jacobwinch
Copy link
Contributor Author

Wonder if other teams would find this service helpful?

Yes, it feels generic enough to be useful for other Elasticsearch users - I'm not sure how other teams are keeping track of these metrics at the moment.

@jacobwinch
Copy link
Contributor Author

This service only has a PROD environment, so I will double check that all of the metrics look correct after merging.

The new alarm configuration looks OK:

image

And the metrics are coming through:

This side by side comparison helps to demonstrate that things are a bit worse than previously thought!

image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants