Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[stacked 5/5] metrics: add topology-aware policy metrics collection. #406

Open
wants to merge 11 commits into
base: main
Choose a base branch
from

Conversation

klihub
Copy link
Collaborator

@klihub klihub commented Nov 11, 2024

Notes: This PR is stacked on top of #405.

Implement metrics collection for the topology-aware policy. Currently we collect for each pool/zone
- name, cpuset and memset
- shared pool capacity, allocation, available amount
- memory capacity, allocation, available amount
- number of containers
- number of containers in the shared pool

@klihub klihub changed the title [4/4] metrics: add topology-aware policy metrics collection. [5/5] metrics: add topology-aware policy metrics collection. Nov 11, 2024
Copy link

@pfl pfl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

pkg/resmgr/lib/memory/zones.go Show resolved Hide resolved
@klihub klihub force-pushed the metrics/topology-aware branch 3 times, most recently from 3bb93cf to 4ea6ec6 Compare November 13, 2024 09:44
@klihub klihub marked this pull request as ready for review November 13, 2024 13:30
@klihub klihub changed the title [5/5] metrics: add topology-aware policy metrics collection. [stack: 5/5] metrics: add topology-aware policy metrics collection. Nov 13, 2024
@klihub klihub changed the title [stack: 5/5] metrics: add topology-aware policy metrics collection. [stacked 5/5] metrics: add topology-aware policy metrics collection. Nov 13, 2024
Rework our metrics collector registry to take care most of
the necessary bits fo metrics registration, collection and
gathering.

Signed-off-by: Krisztian Litkey <krisztian.litkey@intel.com>
Update cgroupstats collector for the reworked metrics registry.
Split out automatic registration to a register subpackage.

Signed-off-by: Krisztian Litkey <krisztian.litkey@intel.com>
Remove the old resmgr-triggered polling of policy metrics
and the old resmgr-level polling policy metrics collector.
Implement policy metrics collection in the policy package
itself.

Signed-off-by: Krisztian Litkey <krisztian.litkey@intel.com>
Remove the old opencensus-based prometheus exporter. Rework
prometheus exporting using our update metrics registry and
a promhttp /metrics-handler.

Signed-off-by: Krisztian Litkey <krisztian.litkey@intel.com>
Add configuration bits for controlling which metrics are
collected. Enable collection of policy metrics by default.

Signed-off-by: Krisztian Litkey <krisztian.litkey@intel.com>
Remove obsolete and unused option entries. Give a warning about
using the now-obsolete '-metrics-interval' argument. It's used
unconditionally by our existing Helm charts, so we'll phase it
out a bit more gently.

Signed-off-by: Krisztian Litkey <krisztian.litkey@intel.com>
Add a metrics/collectors subpackage. When imported it pulls
in and registers the fairly standard buildinfo, process and
golang runtime collectors. Turn on the build info collector
by default.

Signed-off-by: Krisztian Litkey <krisztian.litkey@intel.com>
Simplify the policy-backend metrics collection interface,
reducing it to a single GetMetrics() call and a returned
Metrics interface which simply implements the collector-
like Describe() and Collect() interfaces. Update policy
implementations accordingly.

Signed-off-by: Krisztian Litkey <krisztian.litkey@intel.com>
Implement collection of policy 'system' prometheus metrics.

We collect per each memory node
  - memory capcity
  - memory usage
  - number of containers sharing the node

We collect per each CPU core
  - allocation from that core
  - number of containers sharing the core

Signed-off-by: Krisztian Litkey <krisztian.litkey@intel.com>
Add ZoneAvailable to return the amount of available/allocatable
memory in a zone, capped by the amount of free memory in any of
the ancestors of a zone.

Signed-off-by: Krisztian Litkey <krisztian.litkey@intel.com>
Implement collection of per zone prometheus metrics.
Currently we collect for each pool/zone the following
  - name, cpuset and memset
  - shared pool capacity, allocation, available amount
  - memory capacity, allocation, available amount
  - number of containers
  - number of containers in the shared pool

Signed-off-by: Krisztian Litkey <krisztian.litkey@intel.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants