-
Notifications
You must be signed in to change notification settings - Fork 50
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
UnitsAvailable alerts are firing constantly #564
Labels
bug
Something isn't working
Comments
dparv
added a commit
to dparv/argo-operators
that referenced
this issue
Mar 28, 2023
As stated in issue canonical/bundle-kubeflow#564 the duation for alerts for argo is set to 0m, which is too low for prod environments. We need to change to at least 5m to prevent the flapping behavior. Closes-Bug: canonical/bundle-kubeflow#564
dparv
added a commit
to dparv/training-operator
that referenced
this issue
Mar 28, 2023
As stated in issue canonical/bundle-kubeflow#564 the duation for alerts for argo is set to 0m, which is too low for prod environments. We need to change to at least 5m to prevent the flapping behavior. Partial-Bug: canonical/bundle-kubeflow#564
dparv
added a commit
to dparv/dex-auth-operator
that referenced
this issue
Mar 28, 2023
As stated in issue canonical/bundle-kubeflow#564 the duation for alerts for argo is set to 0m, which is too low for prod environments. We need to change to at least 5m to prevent the flapping behavior. Partial-Bug: canonical/bundle-kubeflow#564
dparv
added a commit
to dparv/notebook-operators
that referenced
this issue
Mar 28, 2023
As stated in issue canonical/bundle-kubeflow#564 the duation for alerts for argo is set to 0m, which is too low for prod environments. We need to change to at least 5m to prevent the flapping behavior. Partial-Bug: canonical/bundle-kubeflow#564
dparv
added a commit
to dparv/metacontroller-operator
that referenced
this issue
Mar 28, 2023
As stated in issue canonical/bundle-kubeflow#564 the duation for alerts for argo is set to 0m, which is too low for prod environments. We need to change to at least 5m to prevent the flapping behavior. Partial-Bug: canonical/bundle-kubeflow#564
dparv
added a commit
to dparv/minio-operator
that referenced
this issue
Mar 28, 2023
As stated in issue canonical/bundle-kubeflow#564 the duation for alerts for argo is set to 0m, which is too low for prod environments. We need to change to at least 5m to prevent the flapping behavior. Partial-Bug: canonical/bundle-kubeflow#564
beliaev-maksim
pushed a commit
to canonical/argo-operators
that referenced
this issue
Mar 29, 2023
As stated in issue canonical/bundle-kubeflow#564 the duation for alerts for argo is set to 0m, which is too low for prod environments. We need to change to at least 5m to prevent the flapping behavior. Closes-Bug: canonical/bundle-kubeflow#564 Co-authored-by: Diko Parvanov <diko.parvanov@canonical.com>
beliaev-maksim
pushed a commit
to canonical/metacontroller-operator
that referenced
this issue
Mar 29, 2023
As stated in issue canonical/bundle-kubeflow#564 the duation for alerts for argo is set to 0m, which is too low for prod environments. We need to change to at least 5m to prevent the flapping behavior. Partial-Bug: canonical/bundle-kubeflow#564 Co-authored-by: Diko Parvanov <diko.parvanov@canonical.com>
beliaev-maksim
pushed a commit
to canonical/minio-operator
that referenced
this issue
Mar 29, 2023
As stated in issue canonical/bundle-kubeflow#564 the duation for alerts for argo is set to 0m, which is too low for prod environments. We need to change to at least 5m to prevent the flapping behavior. Partial-Bug: canonical/bundle-kubeflow#564 Co-authored-by: Diko Parvanov <diko.parvanov@canonical.com>
beliaev-maksim
pushed a commit
to canonical/dex-auth-operator
that referenced
this issue
Mar 29, 2023
As stated in issue canonical/bundle-kubeflow#564 the duation for alerts for argo is set to 0m, which is too low for prod environments. We need to change to at least 5m to prevent the flapping behavior. Partial-Bug: canonical/bundle-kubeflow#564 Co-authored-by: Diko Parvanov <diko.parvanov@canonical.com>
beliaev-maksim
pushed a commit
to canonical/training-operator
that referenced
this issue
Mar 29, 2023
As stated in issue canonical/bundle-kubeflow#564 the duation for alerts for argo is set to 0m, which is too low for prod environments. We need to change to at least 5m to prevent the flapping behavior. Partial-Bug: canonical/bundle-kubeflow#564 Co-authored-by: Diko Parvanov <diko.parvanov@canonical.com>
beliaev-maksim
pushed a commit
to canonical/notebook-operators
that referenced
this issue
Mar 29, 2023
As stated in issue canonical/bundle-kubeflow#564 the duation for alerts for argo is set to 0m, which is too low for prod environments. We need to change to at least 5m to prevent the flapping behavior. Partial-Bug: canonical/bundle-kubeflow#564 Co-authored-by: Diko Parvanov <diko.parvanov@canonical.com>
FIx is merged. |
Thank you @facundofc for letting us know about the issue not having been fixed. In order to better understand the issue, our team will need some more information.
|
orfeas-k
added
bug
Something isn't working
question
Further information is requested from the issue opener
and removed
bug
Something isn't working
labels
Aug 9, 2023
DnPlas
added a commit
to canonical/dex-auth-operator
that referenced
this issue
Feb 13, 2024
* fix: set telemetry config value, patch service, update tests This commit ensures the configuration value for the telemetry setting is correctly passed to the workload configuration value. With this we ensure the workload is correctly exposing metrics in the desired endpoint so they can be scraped by prometheus. With this change we also must ensure that the K8s Service correctly exposes the ports of the endpoints that this workload has (for metrics and the actual dex service). Finally, a bit of refactoring work had to be made in the unit test to match the recent changes and the kubernetes_service_patch library this charm uses has been bumped v0 -> v1. Part of canonical/bundle-kubeflow#563 * tests: add an assertion for checking unit is available The test_prometheus_grafana_integration test case was doing queries to prometheus and checking the request returned successfully and that the application name and model was listed correctly. To make this test case more accurately, we can add an assertion that also checks that the unit is available, this way we avoid issues like the one described in canonical/bundle-kubeflow#564. Part of canonical/bundle-kubeflow#564
DnPlas
added a commit
to canonical/minio-operator
that referenced
this issue
Feb 13, 2024
* fix: set prometheus authentication variable This variable allows public access without authentication for prometheus metrics. Part of canonical/bundle-kubeflow#564 * tests: add an assertion for checking unit is available The test_prometheus_grafana_integration test case was doing queries to prometheus and checking the request returned successfully and that the application name and model was listed correctly. To make this test case more accurately, we can add an assertion that also checks that the unit is available, this way we avoid issues like the one described in canonical/bundle-kubeflow#564. Part of canonical/bundle-kubeflow#564
DnPlas
added a commit
to canonical/dex-auth-operator
that referenced
this issue
Feb 13, 2024
* fix: set telemetry config value, patch service, update tests This commit ensures the configuration value for the telemetry setting is correctly passed to the workload configuration value. With this we ensure the workload is correctly exposing metrics in the desired endpoint so they can be scraped by prometheus. With this change we also must ensure that the K8s Service correctly exposes the ports of the endpoints that this workload has (for metrics and the actual dex service). Finally, a bit of refactoring work had to be made in the unit test to match the recent changes and the kubernetes_service_patch library this charm uses has been bumped v0 -> v1. Part of canonical/bundle-kubeflow#563 * tests: add an assertion for checking unit is available The test_prometheus_grafana_integration test case was doing queries to prometheus and checking the request returned successfully and that the application name and model was listed correctly. To make this test case more accurately, we can add an assertion that also checks that the unit is available, this way we avoid issues like the one described in canonical/bundle-kubeflow#564. Part of canonical/bundle-kubeflow#564
DnPlas
added a commit
to canonical/minio-operator
that referenced
this issue
Feb 13, 2024
* fix: set prometheus authentication variable This variable allows public access without authentication for prometheus metrics. Part of canonical/bundle-kubeflow#564 * tests: add an assertion for checking unit is available The test_prometheus_grafana_integration test case was doing queries to prometheus and checking the request returned successfully and that the application name and model was listed correctly. To make this test case more accurately, we can add an assertion that also checks that the unit is available, this way we avoid issues like the one described in canonical/bundle-kubeflow#564. Part of canonical/bundle-kubeflow#564
DnPlas
added a commit
to canonical/notebook-operators
that referenced
this issue
Feb 13, 2024
* fix: expose metrics port using kubernetes_service_patch lib This commit ensures the metrics port is exposed in the Kubernetes Service for the jupyter-controller using the kubernetes_service_patch lib. This makes the metrics endpoint reachable from external prometheus scraper. This commit also changes the unit tests slightly to adapt to the added service patcher. Part of canonical/bundle-kubeflow#564 * tests: add an assertion for checking unit is available The test_prometheus_grafana_integration test case was doing queries to prometheus and checking the request returned successfully and that the application name and model was listed correctly. To make this test case more accurately, we can add an assertion that also checks that the unit is available, this way we avoid issues like the one described in canonical/bundle-kubeflow#564. Part of canonical/bundle-kubeflow#564
DnPlas
added a commit
to canonical/training-operator
that referenced
this issue
Feb 13, 2024
* fix: expose metrics port using kubernetes_service_patch lib This commit ensures the metrics port is exposed in the Kubernetes Service for the training-operator using the kubernetes_service_patch lib. This makes the metrics endpoint reachable from external prometheus scraper. This commit also changes the unit tests slightly to adapt to the added service patcher. Part of canonical/bundle-kubeflow#564 * tests: add an assertion for checking unit is available The test_prometheus_grafana_integration test case was doing queries to prometheus and checking the request returned successfully and that the application name and model was listed correctly. To make this test case more accurately, we can add an assertion that also checks that the unit is available, this way we avoid issues like the one described in canonical/bundle-kubeflow#564. Part of canonical/bundle-kubeflow#564
DnPlas
added a commit
to canonical/seldon-core-operator
that referenced
this issue
Feb 13, 2024
The metrics endpoint configuration had two scrape jobs, one for the regular metrics endpoint, and a second one based on a dynamic list of targets. The latter was causing the prometheus scraper to try and scrape metrics from *:80/metrics, which is not a valid endpoint. This was causing the UnitsUnavailable alert to fire constantly because that job was reporting back that the endpoint was not available. This new job was introduced by #94 with no apparent justification. Because the seldon charm has changed since that PR, and the endpoint it is configuring is not valid, this commit will remove the extra job. This commit also refactors the MetricsEndpointProvider instantiation and removes the metrics-port config option as this value should not change. Finally, this commit changes the alert rule interval from 0m to 5m, as this interval is more appropriate for production environments. Part of canonical/bundle-kubeflow#564
DnPlas
added a commit
to canonical/seldon-core-operator
that referenced
this issue
Feb 13, 2024
The test_prometheus_grafana_integration test case was doing queries to prometheus and checking the request returned successfully and that the application name and model was listed correctly. To make this test case more accurately, we can add an assertion that also checks that the unit is available, this way we avoid issues like the one described in canonical/bundle-kubeflow#564. Part of canonical/bundle-kubeflow#564 skip: fix test
DnPlas
added a commit
to canonical/seldon-core-operator
that referenced
this issue
Feb 13, 2024
* fix: correctly configure one scrape job to avoid firig alerts The metrics endpoint configuration had two scrape jobs, one for the regular metrics endpoint, and a second one based on a dynamic list of targets. The latter was causing the prometheus scraper to try and scrape metrics from *:80/metrics, which is not a valid endpoint. This was causing the UnitsUnavailable alert to fire constantly because that job was reporting back that the endpoint was not available. This new job was introduced by #94 with no apparent justification. Because the seldon charm has changed since that PR, and the endpoint it is configuring is not valid, this commit will remove the extra job. This commit also refactors the MetricsEndpointProvider instantiation and removes the metrics-port config option as this value should not change. Finally, this commit changes the alert rule interval from 0m to 5m, as this interval is more appropriate for production environments. Part of canonical/bundle-kubeflow#564 * tests: add an assertion for checking unit is available The test_prometheus_grafana_integration test case was doing queries to prometheus and checking the request returned successfully and that the application name and model was listed correctly. To make this test case more accurately, we can add an assertion that also checks that the unit is available, this way we avoid issues like the one described in canonical/bundle-kubeflow#564. Part of canonical/bundle-kubeflow#564
DnPlas
added a commit
to canonical/seldon-core-operator
that referenced
this issue
Feb 13, 2024
* fix: correctly configure one scrape job to avoid firig alerts The metrics endpoint configuration had two scrape jobs, one for the regular metrics endpoint, and a second one based on a dynamic list of targets. The latter was causing the prometheus scraper to try and scrape metrics from *:80/metrics, which is not a valid endpoint. This was causing the UnitsUnavailable alert to fire constantly because that job was reporting back that the endpoint was not available. This new job was introduced by #94 with no apparent justification. Because the seldon charm has changed since that PR, and the endpoint it is configuring is not valid, this commit will remove the extra job. This commit also refactors the MetricsEndpointProvider instantiation and removes the metrics-port config option as this value should not change. Finally, this commit changes the alert rule interval from 0m to 5m, as this interval is more appropriate for production environments. Part of canonical/bundle-kubeflow#564 * tests: add an assertion for checking unit is available The test_prometheus_grafana_integration test case was doing queries to prometheus and checking the request returned successfully and that the application name and model was listed correctly. To make this test case more accurately, we can add an assertion that also checks that the unit is available, this way we avoid issues like the one described in canonical/bundle-kubeflow#564. Part of canonical/bundle-kubeflow#564
DnPlas
added a commit
to canonical/training-operator
that referenced
this issue
Feb 13, 2024
) * fix: expose metrics port using kubernetes_service_patch lib This commit ensures the metrics port is exposed in the Kubernetes Service for the training-operator using the kubernetes_service_patch lib. This makes the metrics endpoint reachable from external prometheus scraper. This commit also changes the unit tests slightly to adapt to the added service patcher. Part of canonical/bundle-kubeflow#564 * tests: add an assertion for checking unit is available The test_prometheus_grafana_integration test case was doing queries to prometheus and checking the request returned successfully and that the application name and model was listed correctly. To make this test case more accurately, we can add an assertion that also checks that the unit is available, this way we avoid issues like the one described in canonical/bundle-kubeflow#564. Part of canonical/bundle-kubeflow#564
DnPlas
added a commit
to canonical/minio-operator
that referenced
this issue
Feb 13, 2024
* fix: set prometheus authentication variable This variable allows public access without authentication for prometheus metrics. Part of canonical/bundle-kubeflow#564 * tests: add an assertion for checking unit is available The test_prometheus_grafana_integration test case was doing queries to prometheus and checking the request returned successfully and that the application name and model was listed correctly. To make this test case more accurately, we can add an assertion that also checks that the unit is available, this way we avoid issues like the one described in canonical/bundle-kubeflow#564. Part of canonical/bundle-kubeflow#564
DnPlas
added a commit
to canonical/dex-auth-operator
that referenced
this issue
Feb 13, 2024
…186) * fix: set telemetry config value, patch service, update tests This commit ensures the configuration value for the telemetry setting is correctly passed to the workload configuration value. With this we ensure the workload is correctly exposing metrics in the desired endpoint so they can be scraped by prometheus. With this change we also must ensure that the K8s Service correctly exposes the ports of the endpoints that this workload has (for metrics and the actual dex service). Finally, a bit of refactoring work had to be made in the unit test to match the recent changes and the kubernetes_service_patch library this charm uses has been bumped v0 -> v1. Part of canonical/bundle-kubeflow#563 * tests: add an assertion for checking unit is available The test_prometheus_grafana_integration test case was doing queries to prometheus and checking the request returned successfully and that the application name and model was listed correctly. To make this test case more accurately, we can add an assertion that also checks that the unit is available, this way we avoid issues like the one described in canonical/bundle-kubeflow#564. Part of canonical/bundle-kubeflow#564
DnPlas
added a commit
to canonical/metacontroller-operator
that referenced
this issue
Feb 13, 2024
…101) * fix: create a Service for the workload and fix the metrics collector This charm was not deploying any Service for the workload container, which is fine for its regular functions, but causes an issue when the Prometheus scraper tries reaching out the metrics endpoint. This commit adds a Service that is attached to the WORKLOAD (the container inside the Pod that gets created by the StatefulSet we are applying manually) so that the metrics from it can be reached correctly. Because of that, the MetricsEndpointProvider's target has to be refactored to point to the correct service. In a previous version of this charm, the target was pointing to the charm's container, which does not have any metrics endpoit, causing the issues reported in canonical/bundle-kubeflow#564. Part of canonical/bundle-kubeflow#564 * tests: add an assertion for checking unit is available The test_prometheus_grafana_integration test case was doing queries to prometheus and checking the request returned successfully and that the application name and model was listed correctly. To make this test case more accurately, we can add an assertion that also checks that the unit is available, this way we avoid issues like the one described in canonical/bundle-kubeflow#564. Part of canonical/bundle-kubeflow#564
DnPlas
added a commit
to canonical/metacontroller-operator
that referenced
this issue
Feb 13, 2024
…101) * fix: create a Service for the workload and fix the metrics collector This charm was not deploying any Service for the workload container, which is fine for its regular functions, but causes an issue when the Prometheus scraper tries reaching out the metrics endpoint. This commit adds a Service that is attached to the WORKLOAD (the container inside the Pod that gets created by the StatefulSet we are applying manually) so that the metrics from it can be reached correctly. Because of that, the MetricsEndpointProvider's target has to be refactored to point to the correct service. In a previous version of this charm, the target was pointing to the charm's container, which does not have any metrics endpoit, causing the issues reported in canonical/bundle-kubeflow#564. Part of canonical/bundle-kubeflow#564 * tests: add an assertion for checking unit is available The test_prometheus_grafana_integration test case was doing queries to prometheus and checking the request returned successfully and that the application name and model was listed correctly. To make this test case more accurately, we can add an assertion that also checks that the unit is available, this way we avoid issues like the one described in canonical/bundle-kubeflow#564. Part of canonical/bundle-kubeflow#564
DnPlas
added a commit
to canonical/metacontroller-operator
that referenced
this issue
Feb 14, 2024
…101) (#102) * fix: create a Service for the workload and fix the metrics collector This charm was not deploying any Service for the workload container, which is fine for its regular functions, but causes an issue when the Prometheus scraper tries reaching out the metrics endpoint. This commit adds a Service that is attached to the WORKLOAD (the container inside the Pod that gets created by the StatefulSet we are applying manually) so that the metrics from it can be reached correctly. Because of that, the MetricsEndpointProvider's target has to be refactored to point to the correct service. In a previous version of this charm, the target was pointing to the charm's container, which does not have any metrics endpoit, causing the issues reported in canonical/bundle-kubeflow#564. Part of canonical/bundle-kubeflow#564 * tests: add an assertion for checking unit is available The test_prometheus_grafana_integration test case was doing queries to prometheus and checking the request returned successfully and that the application name and model was listed correctly. To make this test case more accurately, we can add an assertion that also checks that the unit is available, this way we avoid issues like the one described in canonical/bundle-kubeflow#564. Part of canonical/bundle-kubeflow#564
DnPlas
added a commit
to canonical/seldon-core-operator
that referenced
this issue
Feb 14, 2024
* fix: correctly configure one scrape job to avoid firig alerts The metrics endpoint configuration had two scrape jobs, one for the regular metrics endpoint, and a second one based on a dynamic list of targets. The latter was causing the prometheus scraper to try and scrape metrics from *:80/metrics, which is not a valid endpoint. This was causing the UnitsUnavailable alert to fire constantly because that job was reporting back that the endpoint was not available. This new job was introduced by #94 with no apparent justification. Because the seldon charm has changed since that PR, and the endpoint it is configuring is not valid, this commit will remove the extra job. This commit also refactors the MetricsEndpointProvider instantiation and removes the metrics-port config option as this value should not change. Finally, this commit changes the alert rule interval from 0m to 5m, as this interval is more appropriate for production environments. Part of canonical/bundle-kubeflow#564 * tests: add an assertion for checking unit is available The test_prometheus_grafana_integration test case was doing queries to prometheus and checking the request returned successfully and that the application name and model was listed correctly. To make this test case more accurately, we can add an assertion that also checks that the unit is available, this way we avoid issues like the one described in canonical/bundle-kubeflow#564. Part of canonical/bundle-kubeflow#564
DnPlas
added a commit
to canonical/notebook-operators
that referenced
this issue
Feb 14, 2024
) * fix: expose metrics port using kubernetes_service_patch lib This commit ensures the metrics port is exposed in the Kubernetes Service for the jupyter-controller using the kubernetes_service_patch lib. This makes the metrics endpoint reachable from external prometheus scraper. This commit also changes the unit tests slightly to adapt to the added service patcher. Part of canonical/bundle-kubeflow#564 * tests: add an assertion for checking unit is available The test_prometheus_grafana_integration test case was doing queries to prometheus and checking the request returned successfully and that the application name and model was listed correctly. To make this test case more accurately, we can add an assertion that also checks that the unit is available, this way we avoid issues like the one described in canonical/bundle-kubeflow#564. Part of canonical/bundle-kubeflow#564
DnPlas
added a commit
to canonical/seldon-core-operator
that referenced
this issue
Feb 14, 2024
* fix: correctly configure one scrape job to avoid firig alerts The metrics endpoint configuration had two scrape jobs, one for the regular metrics endpoint, and a second one based on a dynamic list of targets. The latter was causing the prometheus scraper to try and scrape metrics from *:80/metrics, which is not a valid endpoint. This was causing the UnitsUnavailable alert to fire constantly because that job was reporting back that the endpoint was not available. This new job was introduced by #94 with no apparent justification. Because the seldon charm has changed since that PR, and the endpoint it is configuring is not valid, this commit will remove the extra job. This commit also refactors the MetricsEndpointProvider instantiation and removes the metrics-port config option as this value should not change. Finally, this commit changes the alert rule interval from 0m to 5m, as this interval is more appropriate for production environments. Part of canonical/bundle-kubeflow#564 * tests: add an assertion for checking unit is available The test_prometheus_grafana_integration test case was doing queries to prometheus and checking the request returned successfully and that the application name and model was listed correctly. To make this test case more accurately, we can add an assertion that also checks that the unit is available, this way we avoid issues like the one described in canonical/bundle-kubeflow#564. Part of canonical/bundle-kubeflow#564
All PRs have been merged, we can close this issue. Feel free to re-open if this still an issue. |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
On a recent deployment we're seeing these alerts firing all the time (literally, stuck to "firing"):
Looking at the
up
metric (which these alert rules query), we see that these are alternating between 1 and 0 every 45 seconds (this is a sample from the argo controller, query being:up{juju_application="argo-controller",juju_..."="..."}[10m]
):Incidentally to this flapping behavior, the duration for these alerts (at least for argo) is set to
0m
, which seems a bit too sensitive for production envs.The text was updated successfully, but these errors were encountered: