Skip to content

Latest commit

 

History

History
164 lines (117 loc) · 8.99 KB

0037-host-metrics.md

File metadata and controls

164 lines (117 loc) · 8.99 KB

0037: Host metrics

  • Stage: 0 (strawman)
  • Date: 2023-03-01

Fields

The following high level metrics should be per host to indicate its health:

  • CPU used (in %) and load
  • Memory used (in %, used, total)
  • Disk usage (in %) and io -> summary
  • Network (traffic in / out)

This translates to the following metrics. The goal is to have as few as possible.

  • host.cpu.system.norm.pct
  • host.cpu.user.norm.pct
  • host.fsstats.total_size.used (in bytes)
  • host.fsstats.total_size.total (in bytes)
  • host.fsstats.total_size.used.pct
  • host.load.norm.1
  • host.load.norm.5
  • host.load.norm.15
  • host.memory.actual.used.bytes
  • host.memory.actual.used.pct
  • host.memory.total
  • host.network.egress.bytes
  • host.network.ingress.bytes

cgroup metrics were left out of the proposal by design and might be added later on. More details around cgroups can be found in the cgroup RFC.

Usage

These metrics can be used to give a quick overview on how a specific host is doing. Some examples:

  • A agent is running on a host and reports metrics about some services running on it. These metrics are shipped in addition to show how the host is doing.
  • A user is looking at service metrics delivered by APM. These metrics are used to show how the host the service is running on is doing.

In the context if usage, it is also important what is NOT part of the fields by design:

  • Process metrics: Details around process metrics. For this, detailed collection around processes must be enabled
  • Cgroup metrics: cgroup metrics might follow at a later stage

Source data

The source of this data comes from monitoring a host like a Linux machine, laptop or a k8s node. The can come delivered through different shippers like Elastic Agent system metrics inputs, apm agents, prometheus node exporter and other host metric collectors.

Scope of impact

Currently Elastic Agent and metricbeat ship data host/system metrics under the system.* prefix. This would change it to host.*. One of the reasons for this is that some metrics for network already exist under this prefix in ECS so conflicts can be prevented. Another advantage is that some of these fields might use newer field types like gauge and counter delivered by TSDB in Elasticsearch which is possible without a breaking change.

Concerns

  • One of the concerns is it needs to be figured out how to migrate to the new fields with the existing shippers.
  • Not all metrics might be available on all operating systems. How will we deal with this limitation?
  • host.cpu.usage already exist, how do the new fields relate to it.

People

The following are the people that consulted on the contents of this RFC.

  • @ruflin | author
  • @andrewkroh | reviewer
  • @felixbarny | reviewer
  • @gizas | reviewer
  • @lalit-satapathy | reviewer
  • @neptunian | reviewer
  • @tommyers-elastic | reviewer

References

RFC Pull Requests