-
Notifications
You must be signed in to change notification settings - Fork 154
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
[Internal Documentation] Data flow of CPU metrics presented in Fleet …
…UI (#4005) * WIP * More documentation * Fixing filename * Answering question about values shown in /proc/{pid}/stats * More info * Another level of comparison * Remove Unanswered Questions section * Rename heading * Reorg * Document last stage + observations + suggestions * Clarify suggestion * Add suggestion for breaking down by Beat input type+output combination * Add suggestion for not using 5-minute average * Reworking suggestions * Add suggestion for counting monitoring components' contributions to CPU usage * Add links to issues
- Loading branch information
1 parent
158fd47
commit 17f0480
Showing
1 changed file
with
382 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,382 @@ | ||
# How Agent CPU metrics in Fleet are calculated | ||
|
||
## Data Flow | ||
|
||
```mermaid | ||
flowchart TD | ||
K[Fleet UI] -- reads from --> E[Elasticsearch] | ||
M[Metricbeat `http/json` metricset] -- writes to --> E | ||
M -- reads from --> A[Agent `/stats` endpoint] | ||
M -- reads from --> B[*Beat `/stats` endpoint] | ||
A -- reads from --> L1[`elastic-agent-system-metrics` report.SetupMetrics] | ||
L1 -- reads from --> H[Host system] | ||
B -- reads from --> L2[`elastic-agent-system-metrics` report.SetupMetrics] | ||
L2 -- reads from --> H[Host system] | ||
``` | ||
|
||
### Fleet UI reading from Elasticsearch | ||
|
||
The Fleet UI code makes the following query to the `metrics-elastic_agent.*` indices in Elasticsearch. Only CPU-related | ||
aggregations are shown; memory-related aggregations are omitted. | ||
|
||
```json | ||
{ | ||
"size": 0, | ||
"query": { | ||
"bool": { | ||
"must": [ | ||
{ | ||
"terms": { | ||
"_tier": [ "data_hot" ] | ||
} | ||
}, | ||
{ | ||
"range": { | ||
"@timestamp": { | ||
"gte": "now-5m" | ||
} | ||
} | ||
}, | ||
{ | ||
"terms": { | ||
"elastic_agent.id": [ agentIds ] | ||
} | ||
}, | ||
{ | ||
"bool": { | ||
"filter": [ | ||
{ | ||
"bool": { | ||
"should": [ | ||
{ | ||
"term": { | ||
"data_stream.dataset": "elastic_agent.elastic_agent" | ||
} | ||
} | ||
] | ||
} | ||
} | ||
] | ||
} | ||
} | ||
] | ||
} | ||
}, | ||
"aggs": { | ||
"agents": { | ||
"terms": { | ||
"field": "elastic_agent.id", | ||
"size": 1000 | ||
}, | ||
"aggs": { | ||
"sum_cpu": { | ||
"sum_bucket": { | ||
"buckets_path": "processes>avg_cpu" | ||
} | ||
}, | ||
"processes": { | ||
"terms": { | ||
"field": "elastic_agent.process", | ||
"size": 1000, | ||
"order": { | ||
"_count": "desc" | ||
} | ||
}, | ||
"aggs": { | ||
"avg_cpu": { | ||
"avg_bucket": { | ||
"buckets_path": "cpu_time_series>cpu" | ||
} | ||
}, | ||
"cpu_time_series": { | ||
"date_histogram": { | ||
"field": "@timestamp", | ||
"calendar_interval": "minute" | ||
}, | ||
"aggs": { | ||
"max_cpu": { | ||
"max": { | ||
"field": "system.process.cpu.total.value" | ||
} | ||
}, | ||
"cpu_derivative": { | ||
"derivative": { | ||
"buckets_path": "max_cpu", | ||
"gap_policy": "skip", | ||
"unit": "10s" | ||
} | ||
}, | ||
"cpu": { | ||
"bucket_script": { | ||
"buckets_path": { | ||
"cpu_total": "cpu_derivative[normalized_value]" | ||
}, | ||
"script": { | ||
"source": "if (params.cpu_total > 0) { return params.cpu_total / params._interval }", | ||
"lang": "painless", | ||
"params": { | ||
"_interval": 10000 | ||
} | ||
}, | ||
"gap_policy": "skip" | ||
} | ||
} | ||
} | ||
} | ||
} | ||
} | ||
} | ||
} | ||
} | ||
} | ||
``` | ||
|
||
|
||
### Agent and Beats collect CPU metrics using `elastic-agent-system-metrics` | ||
|
||
At startup, both Agent and Beats call the [`SetupMetrics` function](https://github.com/elastic/elastic-agent-system-metrics/blob/085e4529f3c4f91dd377cadbbe7a2bf321989438/report/setup.go#L49) | ||
from the `github.com/elastic/elastic-agent-system-metrics/report` package. This function registers a function with the | ||
monitoring registry. Whenever this function is called, it calculates and reports CPU and other metrics for the process in | ||
question. The calculation of CPU (and other) metrics depends on the OS the Agent or Beat process is running on. | ||
|
||
#### Collecting CPU usage metrics on Linux | ||
|
||
On Linux, CPU usage metrics are collected by [reading the `/proc/$PID/stat` file](https://github.com/elastic/elastic-agent-system-metrics/blob/085e4529f3c4f91dd377cadbbe7a2bf321989438/metric/system/process/process_linux_common.go#L351). | ||
This file contains whitespace-delimited values (fields) for various process metrics and other information. The field at | ||
index 13 (0-based indexing) is the number of CPU ticks utilized by the process in user-space since it was started. The | ||
field at index 14 is the number of CPU ticks utilized by the process in kernel-space since it was started. As such, both | ||
fields contain counter metrics. Both fields show the total number of CPU ticks consumed by the process across all available | ||
cores (as opposed to showing normalized, per-core, values; [proof](https://gist.github.com/ycombinator/d55d884ec979fb86360a00b57f807de3)). | ||
|
||
We want to convert these tick values into milliseconds so it becomes easier to figure out what percentage of CPU was | ||
utilized by the process over a given period of time. For example, if the process utilized 120 milliseconds of CPU time | ||
over a period of 5 minutes, that would be a CPU utilization of 120 / (5 * 60 * 1000) = 0.0004 = 0.04%. | ||
|
||
On a typical Linux host, there are 100 ticks per second. The actual value can be checked by running `getconf CLK_TCK`. | ||
Therefore, if a process utilized T ticks, say in user-space, we can say that the process utilized (T / 100) seconds of | ||
CPU time == (T / 100) * 1000 milliseconds of CPU time. We do this [conversion](https://github.com/elastic/elastic-agent-system-metrics/blob/085e4529f3c4f91dd377cadbbe7a2bf321989438/metric/system/process/process_linux_common.go#L374-L375) | ||
from ticks to milliseconds for both user-space and kernel-space CPU utilization. | ||
|
||
Finally, we [sum up](https://github.com/elastic/elastic-agent-system-metrics/blob/085e4529f3c4f91dd377cadbbe7a2bf321989438/metric/system/process/process_linux_common.go#L376) | ||
the user-space and kernel-space CPU utilization (which is now in milliseconds) to arrive at the total CPU utilization. | ||
|
||
##### Comparison between metrics in `top` output and in `/proc/$PID/stat` file | ||
|
||
The `%CPU` reported for a process in `top` or `htop` output is in the range `[0, n*100]`, where `n` is the number of cores | ||
available on the machine. For example, if a process runs two threads on a two-core machine, with each thread utilizing | ||
about 60% of each core, `top` or `htop` will report `%CPU` as `120.0` (or close to it). | ||
|
||
Applying the calculations from the previous section to corresponding values in the process's `/proc/{pid}/stat` file, | ||
the results match up with what `top` or `htop` report. | ||
|
||
### Metricbeat collects CPU metrics for Agent and the Beats it manages | ||
|
||
There is one input in particular in the Agent policy that ultimately generates the data for the above ES query made by | ||
the Fleet UI. This input is of type `http/metrics`, use the `monitoring` output, and has `id` = `metrics-monitoring-agent`. | ||
A Metricbeat process is spawned for this input. | ||
|
||
There are multiple inputs in the Metricbeat configuration that generate the data for the ES query. | ||
* One input is for generating data for the Agent itself. This input will have `namespace` = `agent` and | ||
`id` = `metrics-monitoring-agent`. | ||
* The remaining inputs will generate data for the various Beats managed by Agent. The number of inputs depends on the | ||
number of Beats. These inputs will have `namespace` = `agent` and `id` = `metrics-monitoring-*beat-$n`, where `$n` | ||
is the 1-based index of the Beat. | ||
|
||
All these inputs run the `http` Metricbeat module, `json` metricset, and poll `$hostname/stats` endpoint every minute, | ||
where `$hostname` is either the TCP address or unix socket path of the Agent's HTTP API or the Beats' HTTP APIs. Each | ||
input has a `copy_fields` processor that copies the value of the `http.agent.beat.cpu` field to the `system.process.cpu` field. | ||
|
||
Since the ES query aggregates on the `system.process.cpu.total.value` field, the corresponding field in the | ||
`$hostname/stats` API response that we're interested in is `.beat.cpu.total.value`. The `.beat.cpu.total.value` returns | ||
a counter value representing the total (user-space + kernel-space) duration, in milliseconds, spent by the Agent or Beat | ||
utilizing the CPU since the process was started. More on how this duration is calculated in the next section. | ||
|
||
#### Comparison between metrics in `/proc/$PID/stat` file and in Agent + Beats `/stats` API output | ||
|
||
Using the following script for a host running Agent and one child Beat (excluding any monitoring Beats), we can see that | ||
the metrics in the `/proc/$PID/stat` file match up with those in the Agent + Beats `/stats` API output. | ||
|
||
```shell | ||
#!/bin/bash | ||
|
||
BEAT_CPU_MS_TOTAL=$(sudo curl -s -X GET --unix-socket '/opt/Elastic/Agent/data/tmp/PGwsYWcynGUYZEjD872Gs-npqbv-30jS.sock' 'http:/f/stats' | jq '.beat.cpu.total.value') | ||
AGENT_CPU_MS_TOTAL=$(sudo curl -s http://localhost:6791/stats | jq '.beat.cpu.total.value') | ||
|
||
echo "Stats from API outputs: $(($BEAT_CPU_MS_TOTAL + $AGENT_CPU_MS_TOTAL))"; | ||
|
||
AGENT_PID=403165 | ||
|
||
AGENT_USER_TICKS=$(cat /proc/$AGENT_PID/stat | cut -d' ' -f14) | ||
AGENT_SYSTEM_TICKS=$(cat /proc/$AGENT_PID/stat | cut -d' ' -f15) | ||
AGENT_TOTAL_TICKS=$(($AGENT_USER_TICKS + $AGENT_SYSTEM_TICKS)) | ||
AGENT_TOTAL_MS=$(($AGENT_TOTAL_TICKS * 1000 / 100)) | ||
|
||
#echo "Agent total ticks: $AGENT_TOTAL_TICKS" | ||
#echo "Agent total ms: $AGENT_TOTAL_MS" | ||
|
||
BEAT_PID=431834 | ||
|
||
BEAT_USER_TICKS=$(cat /proc/$BEAT_PID/stat | cut -d' ' -f14) | ||
BEAT_SYSTEM_TICKS=$(cat /proc/$BEAT_PID/stat | cut -d' ' -f15) | ||
BEAT_TOTAL_TICKS=$(($BEAT_USER_TICKS + $BEAT_SYSTEM_TICKS)) | ||
BEAT_TOTAL_MS=$(($BEAT_TOTAL_TICKS * 1000 / 100)) | ||
|
||
#echo "Beat total ticks: $BEAT_TOTAL_TICKS" | ||
#echo "Beat total ms: $BEAT_TOTAL_MS" | ||
|
||
echo "Stats from /proc/PID/stats files: $(($AGENT_TOTAL_MS + $BEAT_TOTAL_MS))" | ||
``` | ||
|
||
### Comparison between metrics in Agent + Beats `/stats` API output and in Elasticsearch `metrics-elastic_agent*` indices | ||
|
||
This comparison is relatively easy to make. | ||
|
||
First, we call the `/stats` APIs on the machine where Agent and its Beats are running. For example, with an Agent running | ||
one Beat (excluding monitoring Beats): | ||
|
||
```shell | ||
$ sudo curl -s http://localhost:6791/stats | jq '.beat.cpu.total.value' | ||
34810 | ||
$ sudo curl -s -X GET --unix-socket '/opt/Elastic/Agent/data/tmp/PGwsYWcynGUYZEjD872Gs-npqbv-30jS.sock' 'http:/f/stats' | jq '.beat.cpu.total.value' | ||
795000 | ||
``` | ||
|
||
Then we call the Elasticsearch `_search` API on `metrics-elastic_agent*` indices, keeping the query filters the same as | ||
the query being done by Fleet UI, but only considering the latest documents for each `elastic_agent.process`, since the | ||
CPU utilization metrics are counter metrics. | ||
|
||
``` | ||
curl -s -u $ES_USER:$ES_PASS -H 'Content-Type: application/json' 'https://test-cpu.es.us-central1.gcp.cloud.es.io:9243/metrics-elastic_agent*/_search' -d '{ | ||
"sort": [ | ||
{ | ||
"@timestamp": { | ||
"order": "desc" | ||
} | ||
} | ||
], | ||
"collapse": { | ||
"field": "elastic_agent.process" | ||
}, | ||
"_source": [ | ||
"@timestamp", | ||
"system.process.cpu.total.value" | ||
], | ||
"query": { | ||
"bool": { | ||
"must": [ | ||
{ | ||
"terms": { | ||
"_tier": [ | ||
"data_hot" | ||
] | ||
} | ||
}, | ||
{ | ||
"range": { | ||
"@timestamp": { | ||
"gte": "now-5m" | ||
} | ||
} | ||
}, | ||
{ | ||
"terms": { | ||
"elastic_agent.id": [ | ||
"62efabf2-21ec-4764-b5b5-32f7c6ce509b" | ||
] | ||
} | ||
}, | ||
{ | ||
"bool": { | ||
"filter": [ | ||
{ | ||
"bool": { | ||
"should": [ | ||
{ | ||
"term": { | ||
"data_stream.dataset": "elastic_agent.elastic_agent" | ||
} | ||
} | ||
] | ||
} | ||
} | ||
] | ||
} | ||
} | ||
] | ||
} | ||
} | ||
}' | jq -r '.hits.hits[]._source | [ ."@timestamp", .system.process.cpu.total.value ] | @tsv' | ||
``` | ||
``` | ||
2024-01-05T23:21:59.164Z 794990 | ||
2024-01-05T23:21:59.164Z 34810 | ||
``` | ||
|
||
We can see that the metrics in the Agent + Beats `/stats` API outputs match up with those in the Elasticsearch | ||
`metrics-elastic_agent*` indices. | ||
|
||
### Comparison between data in `metrics-elastic_agent*` indices and data shown in Fleet UI | ||
|
||
Given the findings in the previous sections, and looking at the Elasticsearch query, it follows that the CPU usage | ||
metrics shown in the Fleet UI for every Agent are computed as follows from the data in the `metrics-elatic_agent*` indices: | ||
|
||
1. For the Agent process and for every Beat process that's managed by Agent: | ||
1. CPU utilization (in milliseconds; stored in the `system.process.cpu.total.value` field) is considered over a 5-minute period. | ||
2. This 5-minute period is broken down into 1-minute buckets. This is the `cpu_time_series` aggregation in the Elasticsearch | ||
query. | ||
3. Recall that the values stored in `system.process.cpu.total.value` field are counter values, meaning they are expected | ||
to be non-decreasing over time. So to consider a single CPU utilization value for each 1-minute bucket, we take the | ||
`max(system.process.cpu.total.value)`. This is the `max_cpu` aggregation in the Elasticsearch query. | ||
4. Given that the values stored in `system.process.cpu.total.value` field are counter values, to get CPU utilization | ||
(in milliseconds) for each 1-minute period, we subtract the `max(system.process.cpu.total.value)` of a 1-minute | ||
bucket from the same value of the previous 1-minute bucket. Further, we recalculate this value for 10-second buckets | ||
(essentially dividing the 1-minute bucket value by 6). This is all done by the `cpu_derivative` aggregation in the | ||
Elasticsearch query. | ||
5. this per-10-second CPU utilization (in milliseconds) is divided by `10 * 1000` (= 10 seconds expressed as milliseconds), | ||
to get the _percentage_ CPU utilization over 10 elapsed seconds. This is the `cpu` aggregation in the Elasticsearch query. | ||
6. So now we end up with five CPU utilization (in %) values, one for each minute. | ||
7. These are averaged to arrive at the CPU utilization (in %) over five minutes. This is the `avg_cpu` aggregation in | ||
the Elasticsearch query. | ||
2. The 5-minute average CPU utilization (in %) for each of the processes is summed up to arrive at a total CPU utilization | ||
(in %) for the Agent process and all Beat processes managed by Agent. This is the `sum_cpu` aggregation in the Elasticsearch | ||
query. The result is the total CPU utilization (in %) by Agent and it's child Beat processes across all available cores | ||
(as opposed to being normalized per-core). For example, if a machine has 8 cores, the resulting value will be in the range of | ||
(0%, 800%). | ||
|
||
## Observations | ||
|
||
* The CPU utilization (in %) of Agent and each of the Beat processes can vary wildly. Generally speaking, the Agent process | ||
itself does not utilize much CPU. And each Beat process may utilize CPU depending on the type of computations it is performing | ||
on the data. | ||
|
||
* Also, CPU utilization is rarely constant. If the output of `top` or `htop` for Agent and Beat processes is observed over | ||
time, the CPU utilization % shown varies for each process over time. | ||
|
||
* To relate the output seen in `top` or `htop` for Agent and Beat processes with the single value shown in the Fleet UI, | ||
one must observe the values in the `top` / `htop` output over five minutes, arrive at an average CPU utilization (in %) | ||
value for each process, and sum it up. The resulting value will roughly be equal to the single value shown for that Agent | ||
in the Agent Listing page in the Fleet UI. | ||
|
||
## Suggested improvements | ||
|
||
* The `processes` aggregation in the Elasticsearch query should use the field `component.id` instead of `elastic_agent.process`. | ||
This is to correctly account for multiple instances of the same type of Beat (e.g. Filebeat). This can happen if there | ||
are multiple outputs defined in the Agent policy and some inputs of a type (e.g. log) use one output while other inputs | ||
of the same type use another output: https://github.com/elastic/kibana/issues/174458. | ||
|
||
* We should reconsider taking a 5-minute average in the Elasticsearch query made by the Fleet UI and instead take a | ||
30-second or 1-minute average (making corresponding adjustments to the `calendar_interval` value in the `cpu_time_series` | ||
aggregation). This would result in a value closer to what's observed in `top` / `htop` output: https://github.com/elastic/kibana/issues/174799. | ||
|
||
* We should link the value shown in the Fleet UI to a chart that breaks it down for that Agent by `component.id` over time, | ||
so the user can see the CPU utilization per Agent component process, over time: https://github.com/elastic/kibana/issues/174800. | ||
|
||
* The tooltip shown with the "i" in the CPU column should explain that the value is sum of current CPU utilization (in %) | ||
of all Agent component processes, ranging from 0 to (number of cores * 100): https://github.com/elastic/kibana/issues/174801. | ||
|
||
* We should enhance collection and aggregation to include CPU utilization for Agent components managed by the service | ||
runtime (e.g. Endpoint) as well, not just Agent components managed by | ||
the command runtime (e.g. Beats) as we do today: https://github.com/elastic/elastic-agent/issues/4083. | ||
|
||
* We should enhance collection to include CPU utilization for Agent monitoring components so their contributions are also | ||
counted: https://github.com/elastic/elastic-agent/issues/4082. |