Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prefer kube-scheduler's resource metrics to kube-state-metrics' #815

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

rexagod
Copy link
Collaborator

@rexagod rexagod commented Jan 10, 2023

Use kube-scheduler's metrics instead of kube-state-metrics, as they are more precise.

Refer the links below for more details.


Also, refactor kube_pod_status_phase, since statuses other than "Pending" or "Running" are excluded or deprecated.

rules/apps.libsonnet Outdated Show resolved Hide resolved
rules/apps.libsonnet Outdated Show resolved Hide resolved
rules/apps.libsonnet Outdated Show resolved Hide resolved
Copy link
Member

@paulfantom paulfantom left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As far as I like this change I can see a problem with this in managed solutions (like EKS) where access to kube-scheduler is forbidden. In those cases alerts and dashboards which are based on kube-scheduler data won't be useful at all. Due to such an issue can we use OR statements instead of deprecating kube-state-metrics data? I think something like kube_pod_resource_request OR kube_pod_container_resource_request should do the trick.

@rexagod
Copy link
Collaborator Author

rexagod commented Feb 14, 2023

Note to self: Revisit this issue once prometheus/prometheus#9624 is implemented.

@rexagod
Copy link
Collaborator Author

rexagod commented Feb 14, 2023

Just wondering if there's a better (or any) way to format embedded PromQL expressions in *sonnet files? The existing tooling seems to ignore the PromQL expressions within such files.

@rexagod rexagod force-pushed the mon-2823 branch 2 times, most recently from 8102d0c to eb73d96 Compare March 11, 2024 23:13
@rexagod rexagod changed the title Use kube-scheduler's metrics instead of kube-state-metrics Use kube-scheduler's metrics instead of kube-state-metrics' Mar 12, 2024
@rexagod rexagod force-pushed the mon-2823 branch 2 times, most recently from ce7c3f8 to 817b784 Compare March 12, 2024 14:06
Copy link

This PR has been automatically marked as stale because it has not
had any activity in the past 30 days.

The next time this stale check runs, the stale label will be
removed if there is new activity. The issue will be closed in 7
days if there is no new activity.

Thank you for your contributions!

@github-actions github-actions bot added the stale label Sep 15, 2024
@github-actions github-actions bot closed this Sep 23, 2024
@rexagod rexagod reopened this Sep 23, 2024
@rexagod
Copy link
Collaborator Author

rexagod commented Sep 23, 2024

Rebasing.

@rexagod rexagod force-pushed the mon-2823 branch 2 times, most recently from 8f38753 to 8102d0c Compare September 23, 2024 19:01
@rexagod rexagod force-pushed the mon-2823 branch 2 times, most recently from 181ee93 to 6997e03 Compare September 23, 2024 22:44
Refactor kube_pod_status_phase, since statuses other than "Pending" or
"Running" are excluded or deprecated.

Signed-off-by: Pranshu Srivastava <rexagod@gmail.com>
@rexagod rexagod changed the title Use kube-scheduler's metrics instead of kube-state-metrics' Prefer kube-scheduler's resource metrics to kube-state-metrics' Sep 23, 2024
@github-actions github-actions bot removed the stale label Sep 24, 2024
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There seem to be quite a few breaking changes here, by removing recording rules and creating new ones (as a result of changing the rule names). That would affect any queries/dashboards which have been written outside of this repo (i.e. downstream projects).

I would suggest either of the below to maintain compatibility:

  • keeping the existing rules and creating new rules with the desired changes, or:
  • keeping the existing rule names and updating the rule expressions with the changes

What do you think?

Copy link
Collaborator Author

@rexagod rexagod Sep 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added the recording rules back in, makes sense. PTAL. :)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, this feels better to me, thanks! Makes it easier to discuss the new metrics, too.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd vote for creating new recording rules based on kube-scheduler metrics. for instance

  • namespace_memory:kube_pod_container_resource_requests:sum would continue to work on the kube-state-metrics kube_pod_container_resource_requests metric.
  • there would be a new namespace_memory:kube_pod_resource_request:sum recording rule for the kube-scheduler kube_pod_resource_request metric.

wherever we use namespace_memory:kube_pod_container_resource_requests:sum today, it woud be changed to

namespace_memory:kube_pod_resource_request:sum or namespace_memory:kube_pod_container_resource_requests:sum

It would work whichever of the kube-state-metrics metrics and the kube-scheduler metrics are present while being efficient. Ideally there could config options in the mixin to opt-out for kube-state-metrics or kube-scheduler recording metrics.

kube_pod_container_resource_requests{resource="memory",%(kubeStateMetricsSelector)s}
) * on(namespace, pod, %(clusterLabel)s) group_left() max by (namespace, pod, %(clusterLabel)s) (
kube_pod_status_phase{phase=~"Pending|Running"} == 1
kube_pod_resource_request{resource="memory",%(kubeSchedulerSelector)s} or kube_pod_container_resource_requests{resource="memory",%(kubeStateMetricsSelector)s}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not quite sure that it will do the right thing since kube_pod_resource_request and kube_pod_container_resource_requests don't have the exact same labels if I understand correctly.

Copy link
Collaborator Author

@rexagod rexagod Sep 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could be wrong here but wouldn't the differing labelsets (kube-scheduler metrics potentially having the additional scheduler and priority labels) after or be sanitized to the set of specified labels in the max operation, so doing an ignoring (scheduler,priority) will eventually have no effect on the final result?

{
record: 'cluster:namespace:pod_memory:active:kube_pod_resource_request_or_kube_pod_container_resource_requests',
expr: |||
(kube_pod_resource_request{resource="memory",%(kubeSchedulerSelector)s} or kube_pod_container_resource_requests{resource="memory",%(kubeStateMetricsSelector)s})
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As kube_pod_container_resource_requests is container-level and kube_pod_resource_request is not, is it worth aggregating kube_pod_container_resource_requests up to the pod-level so that the labels for the these rules always have consistent labelsets?

So, for example, maybe wrapping the whole expression in a sum by (%(clusterLabel)s, %(namespaceLabel)s, pod, node) (...)?

I wonder if that would make the data a bit easier to work with, regardless of whether you have scheduler or KSM data.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think that this rule should be changed: as you said, kube-scheduler metrics have a pod-level granularity while this recorded metric is per-container.

sum by (namespace, %(clusterLabel)s) (
sum by (namespace, pod, %(clusterLabel)s) (
max by (namespace, pod, container, %(clusterLabel)s) (
kube_pod_container_resource_limits{resource="memory",%(kubeStateMetricsSelector)s}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Think you maybe intended to or the kube_pod_resource_limit metric here?

{
record: 'cluster:namespace:pod_cpu:active:kube_pod_resource_limit_or_kube_pod_container_resource_limits',
expr: |||
(kube_pod_resource_limit{resource="memory",%(kubeSchedulerSelector)s} or kube_pod_container_resource_limits{resource="cpu",%(kubeStateMetricsSelector)s})
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
(kube_pod_resource_limit{resource="memory",%(kubeSchedulerSelector)s} or kube_pod_container_resource_limits{resource="cpu",%(kubeStateMetricsSelector)s})
(kube_pod_resource_limit{resource="cpu",%(kubeSchedulerSelector)s} or kube_pod_container_resource_limits{resource="cpu",%(kubeStateMetricsSelector)s})

sum by (namespace, %(clusterLabel)s) (
sum by (namespace, pod, %(clusterLabel)s) (
max by (namespace, pod, container, %(clusterLabel)s) (
kube_pod_resource_limit{resource="memory",%(kubeSchedulerSelector)s} or kube_pod_container_resource_limits{resource="cpu",%(kubeStateMetricsSelector)s}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
kube_pod_resource_limit{resource="memory",%(kubeSchedulerSelector)s} or kube_pod_container_resource_limits{resource="cpu",%(kubeStateMetricsSelector)s}
kube_pod_resource_limit{resource="cpu",%(kubeSchedulerSelector)s} or kube_pod_container_resource_limits{resource="cpu",%(kubeStateMetricsSelector)s}

{
record: 'cluster:namespace:pod_memory:active:kube_pod_resource_request_or_kube_pod_container_resource_requests',
expr: |||
(kube_pod_resource_request{resource="memory",%(kubeSchedulerSelector)s} or kube_pod_container_resource_requests{resource="memory",%(kubeStateMetricsSelector)s})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think that this rule should be changed: as you said, kube-scheduler metrics have a pod-level granularity while this recorded metric is per-container.

@@ -54,7 +54,7 @@ Jsonnet offers the ability to parameterise configuration, allowing for basic cus
alert: "KubePodNotReady",
expr: |||
sum by (namespace, pod) (
kube_pod_status_phase{%(kubeStateMetricsSelector)s, phase!~"Running|Succeeded"}
kube_pod_status_phase{%(kubeStateMetricsSelector)s, phase!~"Running"}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure about this change, is it needed?

@@ -33,7 +33,7 @@
expr: |||
sum by (namespace, pod, %(clusterLabel)s) (
max by(namespace, pod, %(clusterLabel)s) (
kube_pod_status_phase{%(prefixedNamespaceSelector)s%(kubeStateMetricsSelector)s, phase=~"Pending|Unknown|Failed"}
kube_pod_status_phase{%(prefixedNamespaceSelector)s%(kubeStateMetricsSelector)s, phase="Pending"}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd vote for creating new recording rules based on kube-scheduler metrics. for instance

  • namespace_memory:kube_pod_container_resource_requests:sum would continue to work on the kube-state-metrics kube_pod_container_resource_requests metric.
  • there would be a new namespace_memory:kube_pod_resource_request:sum recording rule for the kube-scheduler kube_pod_resource_request metric.

wherever we use namespace_memory:kube_pod_container_resource_requests:sum today, it woud be changed to

namespace_memory:kube_pod_resource_request:sum or namespace_memory:kube_pod_container_resource_requests:sum

It would work whichever of the kube-state-metrics metrics and the kube-scheduler metrics are present while being efficient. Ideally there could config options in the mixin to opt-out for kube-state-metrics or kube-scheduler recording metrics.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants