Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: correct metrics path for MetricsEndpointProvider (#236) #240

Merged
merged 1 commit into from
Feb 14, 2024

Conversation

DnPlas
Copy link
Contributor

@DnPlas DnPlas commented Feb 13, 2024

  • fix: correctly configure one scrape job to avoid firig alerts

The metrics endpoint configuration had two scrape jobs, one for the regular metrics endpoint, and a second one based on a dynamic list of targets. The latter was causing the prometheus scraper to try and scrape metrics from *:80/metrics, which is not a valid endpoint. This was causing the UnitsUnavailable alert to fire constantly because that job was reporting back that the endpoint was not available. This new job was introduced by #94 with no apparent justification. Because the seldon charm has changed since that PR, and the endpoint it is configuring is not valid, this commit will remove the extra job.

This commit also refactors the MetricsEndpointProvider instantiation and removes the metrics-port config option as this value should not change.

Finally, this commit changes the alert rule interval from 0m to 5m, as this interval is more appropriate for production environments.

Part of canonical/bundle-kubeflow#564

  • tests: add an assertion for checking unit is available

The test_prometheus_grafana_integration test case was doing queries to prometheus and checking the request returned successfully and that the application name and model was listed correctly. To make this test case more accurately, we can add an assertion that also checks that the unit is available, this way we avoid issues like the one described in canonical/bundle-kubeflow#564.

Part of canonical/bundle-kubeflow#564

@DnPlas DnPlas added the backport Backport a change from main into branch label Feb 13, 2024
@DnPlas DnPlas requested a review from a team as a code owner February 13, 2024 14:46
@DnPlas
Copy link
Contributor Author

DnPlas commented Feb 14, 2024

CI is failing because of canonical/bundle-kubeflow#813, #241 should fix it.

* fix: correctly configure one scrape job to avoid firig alerts

The metrics endpoint configuration had two scrape jobs, one for the
regular metrics endpoint, and a second one based on a dynamic list of
targets. The latter was causing the prometheus scraper to try and scrape
metrics from *:80/metrics, which is not a valid endpoint. This was
causing the UnitsUnavailable alert to fire constantly because that job
was reporting back that the endpoint was not available.
This new job was introduced by #94
with no apparent justification. Because the seldon charm has changed
since that PR, and the endpoint it is configuring is not valid, this
commit will remove the extra job.

This commit also refactors the MetricsEndpointProvider instantiation and
removes the metrics-port config option as this value should not change.

Finally, this commit changes the alert rule interval from 0m to 5m, as
this interval is more appropriate for production environments.

Part of canonical/bundle-kubeflow#564

* tests: add an assertion for checking unit is available

The test_prometheus_grafana_integration test case was doing queries to prometheus
and checking the request returned successfully and that the application name and model
was listed correctly. To make this test case more accurately, we can add an assertion that
also checks that the unit is available, this way we avoid issues like the one described in
canonical/bundle-kubeflow#564.

Part of canonical/bundle-kubeflow#564
@DnPlas DnPlas merged commit 8631493 into track/1.17 Feb 14, 2024
17 checks passed
@DnPlas DnPlas deleted the KF-1647-backport-fix-564 branch February 14, 2024 15:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport Backport a change from main into branch Libraries: Out of sync
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants