Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[opensearch] Disk usage alarm can trigger for managed OpenSearch domain due to missing data points #602

Open
condrayr opened this issue Nov 19, 2024 · 0 comments
Labels
bug Something isn't working

Comments

@condrayr
Copy link

Version

8.3.3

Steps and/or minimal code example to reproduce

  1. Have a managed OpenSearch domain that sporadically doesn't report the FreeStorageSpace. According to AWS Support, this can occur because of an update to the Internal agent responsible for publishing the CloudWatch metrics for the domain
  2. Notice the alarm triggers due to this missing data point, even though the disk usage threshold of the domain is below the threshold. This is because of the metric math used for this alarm, which is 100 * (used/(used+free)), becomes 100 * (used/(used+0)) = 100 * (used/used) = 100

Expected behavior

The alarm should not trigger, as the domain is not actually exceeding the disk usage threshold.

Actual behavior

The alarm triggers.

Other details

According to AWS Support, the FreeStorageSpace metric can be sporadically missing data points because of an update to the Internal agent responsible for publishing the CloudWatch metrics for the domain.

I examined the domain and could observe that the missing data point was because of an update to the Internal agent responsible for publishing the CloudWatch metrics for the domain and because of which the FreeStorageMetrics was not reported for 18:51, 18:52 UTC. Further, nothing can be done from your end to avoid this missing data point, however in case in future if you observe any scenario where the metric miss the data point for a extended duration, please do let us know and we can look into this further.

As a workaround, the number of data points to alarm on can be increased, as for the two examples we've seen, there were only one or two missing data points at a time.

It may also be worth considering adding alarms for continued missing metrics.

@condrayr condrayr added the bug Something isn't working label Nov 19, 2024
@echeung-amzn echeung-amzn changed the title Disk usage alarm can trigger for managed OpenSearch domain due to missing data points [opensearch] Disk usage alarm can trigger for managed OpenSearch domain due to missing data points Dec 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant