UnitsAvailable alerts are firing constantly #564

facundofc · 2023-03-27T16:48:04Z

On a recent deployment we're seeing these alerts firing all the time (literally, stuck to "firing"):

ArgoUnitIsUnavailable
DexAuthUnitIsUnavailable
JupyterControllerUnitIsUnavailable
MetacontrollerUnitIsUnavailable
MinioUnitIsUnavailable
TrainingOperatorUnitIsUnavailable

Looking at the up metric (which these alert rules query), we see that these are alternating between 1 and 0 every 45 seconds (this is a sample from the argo controller, query being: up{juju_application="argo-controller",juju_..."="..."}[10m]):

1 @1679925736.77
0 @1679925781.26
1 @1679925796.77
0 @1679925841.26
1 @1679925856.77
0 @1679925901.26
1 @1679925916.77
0 @1679925961.26
1 @1679925976.77
0 @1679926021.26
1 @1679926036.77
0 @1679926081.26
1 @1679926096.77
0 @1679926141.26
1 @1679926156.77
0 @1679926201.26
1 @1679926216.77
0 @1679926261.26
1 @1679926276.77
0 @1679926321.26

Incidentally to this flapping behavior, the duration for these alerts (at least for argo) is set to 0m, which seems a bit too sensitive for production envs.

The text was updated successfully, but these errors were encountered:

As stated in issue canonical/bundle-kubeflow#564 the duation for alerts for argo is set to 0m, which is too low for prod environments. We need to change to at least 5m to prevent the flapping behavior. Closes-Bug: canonical/bundle-kubeflow#564

As stated in issue canonical/bundle-kubeflow#564 the duation for alerts for argo is set to 0m, which is too low for prod environments. We need to change to at least 5m to prevent the flapping behavior. Partial-Bug: canonical/bundle-kubeflow#564

As stated in issue canonical/bundle-kubeflow#564 the duation for alerts for argo is set to 0m, which is too low for prod environments. We need to change to at least 5m to prevent the flapping behavior. Closes-Bug: canonical/bundle-kubeflow#564 Co-authored-by: Diko Parvanov <diko.parvanov@canonical.com>

As stated in issue canonical/bundle-kubeflow#564 the duation for alerts for argo is set to 0m, which is too low for prod environments. We need to change to at least 5m to prevent the flapping behavior. Partial-Bug: canonical/bundle-kubeflow#564 Co-authored-by: Diko Parvanov <diko.parvanov@canonical.com>

i-chvets · 2023-05-12T12:50:03Z

FIx is merged.

facundofc · 2023-07-31T19:28:28Z

@i-chvets, the changes pushed by @dparv (changing the for: from 0 to 5m) are not a fix for this issue. I believe this should be reopened as the metrics are flapping or directly stuck to 0 (as in the dex-auth case). That needs to be addressed or pointed out here where it was addressed.

Thanks!

orfeas-k · 2023-08-09T09:08:11Z

Thank you @facundofc for letting us know about the issue not having been fixed. In order to better understand the issue, our team will need some more information.

Are there specific steps to follow in order to reproduce the issue?
Do you deploy these charms alone or through the bundle?
Is there a specific a test environmnet or a CI where this has run? That would be of great help too
What would be the expected behaviour? I understand that we don't want the flapping between 1 and 0, but what would we expect them to be? Also, should alerts should move from the firing stage to a next one?

* fix: set telemetry config value, patch service, update tests This commit ensures the configuration value for the telemetry setting is correctly passed to the workload configuration value. With this we ensure the workload is correctly exposing metrics in the desired endpoint so they can be scraped by prometheus. With this change we also must ensure that the K8s Service correctly exposes the ports of the endpoints that this workload has (for metrics and the actual dex service). Finally, a bit of refactoring work had to be made in the unit test to match the recent changes and the kubernetes_service_patch library this charm uses has been bumped v0 -> v1. Part of canonical/bundle-kubeflow#563 * tests: add an assertion for checking unit is available The test_prometheus_grafana_integration test case was doing queries to prometheus and checking the request returned successfully and that the application name and model was listed correctly. To make this test case more accurately, we can add an assertion that also checks that the unit is available, this way we avoid issues like the one described in canonical/bundle-kubeflow#564. Part of canonical/bundle-kubeflow#564

* fix: set prometheus authentication variable This variable allows public access without authentication for prometheus metrics. Part of canonical/bundle-kubeflow#564 * tests: add an assertion for checking unit is available The test_prometheus_grafana_integration test case was doing queries to prometheus and checking the request returned successfully and that the application name and model was listed correctly. To make this test case more accurately, we can add an assertion that also checks that the unit is available, this way we avoid issues like the one described in canonical/bundle-kubeflow#564. Part of canonical/bundle-kubeflow#564

* fix: set telemetry config value, patch service, update tests This commit ensures the configuration value for the telemetry setting is correctly passed to the workload configuration value. With this we ensure the workload is correctly exposing metrics in the desired endpoint so they can be scraped by prometheus. With this change we also must ensure that the K8s Service correctly exposes the ports of the endpoints that this workload has (for metrics and the actual dex service). Finally, a bit of refactoring work had to be made in the unit test to match the recent changes and the kubernetes_service_patch library this charm uses has been bumped v0 -> v1. Part of canonical/bundle-kubeflow#563 * tests: add an assertion for checking unit is available The test_prometheus_grafana_integration test case was doing queries to prometheus and checking the request returned successfully and that the application name and model was listed correctly. To make this test case more accurately, we can add an assertion that also checks that the unit is available, this way we avoid issues like the one described in canonical/bundle-kubeflow#564. Part of canonical/bundle-kubeflow#564

* fix: set prometheus authentication variable This variable allows public access without authentication for prometheus metrics. Part of canonical/bundle-kubeflow#564 * tests: add an assertion for checking unit is available The test_prometheus_grafana_integration test case was doing queries to prometheus and checking the request returned successfully and that the application name and model was listed correctly. To make this test case more accurately, we can add an assertion that also checks that the unit is available, this way we avoid issues like the one described in canonical/bundle-kubeflow#564. Part of canonical/bundle-kubeflow#564

* fix: expose metrics port using kubernetes_service_patch lib This commit ensures the metrics port is exposed in the Kubernetes Service for the jupyter-controller using the kubernetes_service_patch lib. This makes the metrics endpoint reachable from external prometheus scraper. This commit also changes the unit tests slightly to adapt to the added service patcher. Part of canonical/bundle-kubeflow#564 * tests: add an assertion for checking unit is available The test_prometheus_grafana_integration test case was doing queries to prometheus and checking the request returned successfully and that the application name and model was listed correctly. To make this test case more accurately, we can add an assertion that also checks that the unit is available, this way we avoid issues like the one described in canonical/bundle-kubeflow#564. Part of canonical/bundle-kubeflow#564

* fix: expose metrics port using kubernetes_service_patch lib This commit ensures the metrics port is exposed in the Kubernetes Service for the training-operator using the kubernetes_service_patch lib. This makes the metrics endpoint reachable from external prometheus scraper. This commit also changes the unit tests slightly to adapt to the added service patcher. Part of canonical/bundle-kubeflow#564 * tests: add an assertion for checking unit is available The test_prometheus_grafana_integration test case was doing queries to prometheus and checking the request returned successfully and that the application name and model was listed correctly. To make this test case more accurately, we can add an assertion that also checks that the unit is available, this way we avoid issues like the one described in canonical/bundle-kubeflow#564. Part of canonical/bundle-kubeflow#564

The metrics endpoint configuration had two scrape jobs, one for the regular metrics endpoint, and a second one based on a dynamic list of targets. The latter was causing the prometheus scraper to try and scrape metrics from *:80/metrics, which is not a valid endpoint. This was causing the UnitsUnavailable alert to fire constantly because that job was reporting back that the endpoint was not available. This new job was introduced by #94 with no apparent justification. Because the seldon charm has changed since that PR, and the endpoint it is configuring is not valid, this commit will remove the extra job. This commit also refactors the MetricsEndpointProvider instantiation and removes the metrics-port config option as this value should not change. Finally, this commit changes the alert rule interval from 0m to 5m, as this interval is more appropriate for production environments. Part of canonical/bundle-kubeflow#564

The test_prometheus_grafana_integration test case was doing queries to prometheus and checking the request returned successfully and that the application name and model was listed correctly. To make this test case more accurately, we can add an assertion that also checks that the unit is available, this way we avoid issues like the one described in canonical/bundle-kubeflow#564. Part of canonical/bundle-kubeflow#564 skip: fix test

* fix: correctly configure one scrape job to avoid firig alerts The metrics endpoint configuration had two scrape jobs, one for the regular metrics endpoint, and a second one based on a dynamic list of targets. The latter was causing the prometheus scraper to try and scrape metrics from *:80/metrics, which is not a valid endpoint. This was causing the UnitsUnavailable alert to fire constantly because that job was reporting back that the endpoint was not available. This new job was introduced by #94 with no apparent justification. Because the seldon charm has changed since that PR, and the endpoint it is configuring is not valid, this commit will remove the extra job. This commit also refactors the MetricsEndpointProvider instantiation and removes the metrics-port config option as this value should not change. Finally, this commit changes the alert rule interval from 0m to 5m, as this interval is more appropriate for production environments. Part of canonical/bundle-kubeflow#564 * tests: add an assertion for checking unit is available The test_prometheus_grafana_integration test case was doing queries to prometheus and checking the request returned successfully and that the application name and model was listed correctly. To make this test case more accurately, we can add an assertion that also checks that the unit is available, this way we avoid issues like the one described in canonical/bundle-kubeflow#564. Part of canonical/bundle-kubeflow#564

) * fix: expose metrics port using kubernetes_service_patch lib This commit ensures the metrics port is exposed in the Kubernetes Service for the training-operator using the kubernetes_service_patch lib. This makes the metrics endpoint reachable from external prometheus scraper. This commit also changes the unit tests slightly to adapt to the added service patcher. Part of canonical/bundle-kubeflow#564 * tests: add an assertion for checking unit is available The test_prometheus_grafana_integration test case was doing queries to prometheus and checking the request returned successfully and that the application name and model was listed correctly. To make this test case more accurately, we can add an assertion that also checks that the unit is available, this way we avoid issues like the one described in canonical/bundle-kubeflow#564. Part of canonical/bundle-kubeflow#564

* fix: set prometheus authentication variable This variable allows public access without authentication for prometheus metrics. Part of canonical/bundle-kubeflow#564 * tests: add an assertion for checking unit is available The test_prometheus_grafana_integration test case was doing queries to prometheus and checking the request returned successfully and that the application name and model was listed correctly. To make this test case more accurately, we can add an assertion that also checks that the unit is available, this way we avoid issues like the one described in canonical/bundle-kubeflow#564. Part of canonical/bundle-kubeflow#564

…186) * fix: set telemetry config value, patch service, update tests This commit ensures the configuration value for the telemetry setting is correctly passed to the workload configuration value. With this we ensure the workload is correctly exposing metrics in the desired endpoint so they can be scraped by prometheus. With this change we also must ensure that the K8s Service correctly exposes the ports of the endpoints that this workload has (for metrics and the actual dex service). Finally, a bit of refactoring work had to be made in the unit test to match the recent changes and the kubernetes_service_patch library this charm uses has been bumped v0 -> v1. Part of canonical/bundle-kubeflow#563 * tests: add an assertion for checking unit is available The test_prometheus_grafana_integration test case was doing queries to prometheus and checking the request returned successfully and that the application name and model was listed correctly. To make this test case more accurately, we can add an assertion that also checks that the unit is available, this way we avoid issues like the one described in canonical/bundle-kubeflow#564. Part of canonical/bundle-kubeflow#564

…101) * fix: create a Service for the workload and fix the metrics collector This charm was not deploying any Service for the workload container, which is fine for its regular functions, but causes an issue when the Prometheus scraper tries reaching out the metrics endpoint. This commit adds a Service that is attached to the WORKLOAD (the container inside the Pod that gets created by the StatefulSet we are applying manually) so that the metrics from it can be reached correctly. Because of that, the MetricsEndpointProvider's target has to be refactored to point to the correct service. In a previous version of this charm, the target was pointing to the charm's container, which does not have any metrics endpoit, causing the issues reported in canonical/bundle-kubeflow#564. Part of canonical/bundle-kubeflow#564 * tests: add an assertion for checking unit is available The test_prometheus_grafana_integration test case was doing queries to prometheus and checking the request returned successfully and that the application name and model was listed correctly. To make this test case more accurately, we can add an assertion that also checks that the unit is available, this way we avoid issues like the one described in canonical/bundle-kubeflow#564. Part of canonical/bundle-kubeflow#564

…101) (#102) * fix: create a Service for the workload and fix the metrics collector This charm was not deploying any Service for the workload container, which is fine for its regular functions, but causes an issue when the Prometheus scraper tries reaching out the metrics endpoint. This commit adds a Service that is attached to the WORKLOAD (the container inside the Pod that gets created by the StatefulSet we are applying manually) so that the metrics from it can be reached correctly. Because of that, the MetricsEndpointProvider's target has to be refactored to point to the correct service. In a previous version of this charm, the target was pointing to the charm's container, which does not have any metrics endpoit, causing the issues reported in canonical/bundle-kubeflow#564. Part of canonical/bundle-kubeflow#564 * tests: add an assertion for checking unit is available The test_prometheus_grafana_integration test case was doing queries to prometheus and checking the request returned successfully and that the application name and model was listed correctly. To make this test case more accurately, we can add an assertion that also checks that the unit is available, this way we avoid issues like the one described in canonical/bundle-kubeflow#564. Part of canonical/bundle-kubeflow#564

* fix: correctly configure one scrape job to avoid firig alerts The metrics endpoint configuration had two scrape jobs, one for the regular metrics endpoint, and a second one based on a dynamic list of targets. The latter was causing the prometheus scraper to try and scrape metrics from *:80/metrics, which is not a valid endpoint. This was causing the UnitsUnavailable alert to fire constantly because that job was reporting back that the endpoint was not available. This new job was introduced by #94 with no apparent justification. Because the seldon charm has changed since that PR, and the endpoint it is configuring is not valid, this commit will remove the extra job. This commit also refactors the MetricsEndpointProvider instantiation and removes the metrics-port config option as this value should not change. Finally, this commit changes the alert rule interval from 0m to 5m, as this interval is more appropriate for production environments. Part of canonical/bundle-kubeflow#564 * tests: add an assertion for checking unit is available The test_prometheus_grafana_integration test case was doing queries to prometheus and checking the request returned successfully and that the application name and model was listed correctly. To make this test case more accurately, we can add an assertion that also checks that the unit is available, this way we avoid issues like the one described in canonical/bundle-kubeflow#564. Part of canonical/bundle-kubeflow#564

) * fix: expose metrics port using kubernetes_service_patch lib This commit ensures the metrics port is exposed in the Kubernetes Service for the jupyter-controller using the kubernetes_service_patch lib. This makes the metrics endpoint reachable from external prometheus scraper. This commit also changes the unit tests slightly to adapt to the added service patcher. Part of canonical/bundle-kubeflow#564 * tests: add an assertion for checking unit is available The test_prometheus_grafana_integration test case was doing queries to prometheus and checking the request returned successfully and that the application name and model was listed correctly. To make this test case more accurately, we can add an assertion that also checks that the unit is available, this way we avoid issues like the one described in canonical/bundle-kubeflow#564. Part of canonical/bundle-kubeflow#564

* fix: correctly configure one scrape job to avoid firig alerts The metrics endpoint configuration had two scrape jobs, one for the regular metrics endpoint, and a second one based on a dynamic list of targets. The latter was causing the prometheus scraper to try and scrape metrics from *:80/metrics, which is not a valid endpoint. This was causing the UnitsUnavailable alert to fire constantly because that job was reporting back that the endpoint was not available. This new job was introduced by #94 with no apparent justification. Because the seldon charm has changed since that PR, and the endpoint it is configuring is not valid, this commit will remove the extra job. This commit also refactors the MetricsEndpointProvider instantiation and removes the metrics-port config option as this value should not change. Finally, this commit changes the alert rule interval from 0m to 5m, as this interval is more appropriate for production environments. Part of canonical/bundle-kubeflow#564 * tests: add an assertion for checking unit is available The test_prometheus_grafana_integration test case was doing queries to prometheus and checking the request returned successfully and that the application name and model was listed correctly. To make this test case more accurately, we can add an assertion that also checks that the unit is available, this way we avoid issues like the one described in canonical/bundle-kubeflow#564. Part of canonical/bundle-kubeflow#564

DnPlas · 2024-02-16T12:08:38Z

All PRs have been merged, we can close this issue. Feel free to re-open if this still an issue.

dparv mentioned this issue Mar 28, 2023

Changed interval for argo controller prometheus canonical/argo-operators#95

Merged

dparv mentioned this issue Mar 28, 2023

Changed unit_unavailable interval for prometheus canonical/training-operator#100

Merged

dparv mentioned this issue Mar 28, 2023

Changed unit_unavailable interval for prometheus canonical/dex-auth-operator#121

Merged

dparv mentioned this issue Mar 28, 2023

Changed unit_unavailable interval for prometheus canonical/notebook-operators#229

Merged

dparv mentioned this issue Mar 28, 2023

Changed unit_unavailable interval for prometheus canonical/metacontroller-operator#58

Merged

dparv mentioned this issue Mar 28, 2023

Changed unit_unavailable interval for prometheus canonical/minio-operator#123

Merged

i-chvets closed this as completed May 12, 2023

orfeas-k reopened this Aug 9, 2023

orfeas-k added bug Something isn't working question Further information is requested from the issue opener and removed bug Something isn't working labels Aug 9, 2023

DnPlas mentioned this issue Feb 13, 2024

fix: set telemetry config value, patch service, update tests (#185) canonical/dex-auth-operator#186

Merged

DnPlas mentioned this issue Feb 13, 2024

fix: set prometheus authentication variable (#157) canonical/minio-operator#158

Merged

DnPlas mentioned this issue Feb 13, 2024

fix: expose metrics port using kubernetes_service_patch lib (#332) canonical/notebook-operators#333

Merged

DnPlas mentioned this issue Feb 13, 2024

fix: expose metrics port using kubernetes_service_patch lib (#151) canonical/training-operator#152

Merged

DnPlas mentioned this issue Feb 13, 2024

fix: correct metrics path for MetricsEndpointProvider (#236) canonical/seldon-core-operator#240

Merged

DnPlas mentioned this issue Feb 13, 2024

fix: create a Service for the workload and fix the metrics collector … canonical/metacontroller-operator#102

Merged

DnPlas closed this as completed Feb 16, 2024

github-project-automation bot added this to MLOps Solution Issues Aug 29, 2024

github-project-automation bot moved this to Done in MLOps Solution Issues Aug 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

UnitsAvailable alerts are firing constantly #564

UnitsAvailable alerts are firing constantly #564

facundofc commented Mar 27, 2023

i-chvets commented May 12, 2023

facundofc commented Jul 31, 2023

orfeas-k commented Aug 9, 2023

DnPlas commented Feb 16, 2024

UnitsAvailable alerts are firing constantly #564

UnitsAvailable alerts are firing constantly #564

Comments

facundofc commented Mar 27, 2023

i-chvets commented May 12, 2023

facundofc commented Jul 31, 2023

orfeas-k commented Aug 9, 2023

DnPlas commented Feb 16, 2024