0037: Host metrics

Stage: 0 (strawman)
Date: 2023-03-01

Fields

The following high level metrics should be per host to indicate its health:

CPU used (in %) and load
Memory used (in %, used, total)
Disk usage (in %) and io -> summary
Network (traffic in / out)

This translates to the following metrics. The goal is to have as few as possible.

host.cpu.system.norm.pct
host.cpu.user.norm.pct
host.fsstats.total_size.used (in bytes)
host.fsstats.total_size.total (in bytes)
host.fsstats.total_size.used.pct
host.load.norm.1
host.load.norm.5
host.load.norm.15
host.memory.actual.used.bytes
host.memory.actual.used.pct
host.memory.total
host.network.egress.bytes
host.network.ingress.bytes

cgroup metrics were left out of the proposal by design and might be added later on. More details around cgroups can be found in the cgroup RFC.

Usage

These metrics can be used to give a quick overview on how a specific host is doing. Some examples:

A agent is running on a host and reports metrics about some services running on it. These metrics are shipped in addition to show how the host is doing.
A user is looking at service metrics delivered by APM. These metrics are used to show how the host the service is running on is doing.

In the context if usage, it is also important what is NOT part of the fields by design:

Process metrics: Details around process metrics. For this, detailed collection around processes must be enabled
Cgroup metrics: cgroup metrics might follow at a later stage

Source data

The source of this data comes from monitoring a host like a Linux machine, laptop or a k8s node. The can come delivered through different shippers like Elastic Agent system metrics inputs, apm agents, prometheus node exporter and other host metric collectors.

Scope of impact

Currently Elastic Agent and metricbeat ship data host/system metrics under the system.* prefix. This would change it to host.*. One of the reasons for this is that some metrics for network already exist under this prefix in ECS so conflicts can be prevented. Another advantage is that some of these fields might use newer field types like gauge and counter delivered by TSDB in Elasticsearch which is possible without a breaking change.

Concerns

One of the concerns is it needs to be figured out how to migrate to the new fields with the existing shippers.
Not all metrics might be available on all operating systems. How will we deal with this limitation?
host.cpu.usage already exist, how do the new fields relate to it.

People

The following are the people that consulted on the contents of this RFC.

@ruflin | author
@andrewkroh | reviewer
@felixbarny | reviewer
@gizas | reviewer
@lalit-satapathy | reviewer
@neptunian | reviewer
@tommyers-elastic | reviewer

References

Schema for metrics in ECS
Otel host metrics
ECS cgroup rfc
Prometheus Node Exporter
APM System metrics fields
APM Agent system metrics fields
APM addition of Cgroup metrics
Host metrics used in Inventory view of Kibana (related queries)

RFC Pull Requests

Stage 0: #2129

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

0037-host-metrics.md

0037-host-metrics.md

0037: Host metrics

Fields

Usage

Source data

Scope of impact

Concerns

People

References

RFC Pull Requests

Files

0037-host-metrics.md

Latest commit

History

0037-host-metrics.md

File metadata and controls

0037: Host metrics

Fields

Usage

Source data

Scope of impact

Concerns

People

References

RFC Pull Requests