[Internal Documentation] Data flow of CPU metrics presented in Fleet …

…UI (#4005) * WIP * More documentation * Fixing filename * Answering question about values shown in /proc/{pid}/stats * More info * Another level of comparison * Remove Unanswered Questions section * Rename heading * Reorg * Document last stage + observations + suggestions * Clarify suggestion * Add suggestion for breaking down by Beat input type+output combination * Add suggestion for not using 5-minute average * Reworking suggestions * Add suggestion for counting monitoring components' contributions to CPU usage * Add links to issues
elastic · Jan 15, 2024 · 17f0480 · 17f0480
1 parent 158fd47
commit 17f0480
Showing 1 changed file with 382 additions and 0 deletions.
diff --git a/docs/cpu-metrics-in-fleet.md b/docs/cpu-metrics-in-fleet.md
@@ -0,0 +1,382 @@
+# How Agent CPU metrics in Fleet are calculated
+
+## Data Flow
+
+```mermaid
+flowchart TD
+    K[Fleet UI] -- reads from --> E[Elasticsearch]
+    M[Metricbeat `http/json` metricset] -- writes to --> E
+    M -- reads from --> A[Agent `/stats` endpoint]
+    M -- reads from --> B[*Beat `/stats` endpoint]
+    A -- reads from --> L1[`elastic-agent-system-metrics` report.SetupMetrics]
+    L1 -- reads from --> H[Host system]
+    B -- reads from --> L2[`elastic-agent-system-metrics` report.SetupMetrics]
+    L2 -- reads from --> H[Host system]
+```
+
+### Fleet UI reading from Elasticsearch
+
+The Fleet UI code makes the following query to the `metrics-elastic_agent.*` indices in Elasticsearch.  Only CPU-related
+aggregations are shown; memory-related aggregations are omitted.
+
+```json
+{
+  "size": 0,
+  "query": {
+    "bool": {
+      "must": [
+        {
+          "terms": {
+            "_tier": [ "data_hot" ]
+          }
+        },
+        {
+          "range": {
+            "@timestamp": {
+              "gte": "now-5m"
+            }
+          }
+        },
+        {
+          "terms": {
+            "elastic_agent.id": [ agentIds ]
+          }
+        },
+        {
+          "bool": {
+            "filter": [
+              {
+                "bool": {
+                  "should": [
+                    {
+                      "term": {
+                        "data_stream.dataset": "elastic_agent.elastic_agent"
+                      }
+                    }
+                  ]
+                }
+              }
+            ]
+          }
+        }
+      ]
+    }
+  },
+  "aggs": {
+    "agents": {
+      "terms": {
+        "field": "elastic_agent.id",
+        "size": 1000
+      },
+      "aggs": {
+        "sum_cpu": {
+          "sum_bucket": {
+            "buckets_path": "processes>avg_cpu"
+          }
+        },
+        "processes": {
+          "terms": {
+            "field": "elastic_agent.process",
+            "size": 1000,
+            "order": {
+              "_count": "desc"
+            }
+          },
+          "aggs": {
+            "avg_cpu": {
+              "avg_bucket": {
+                "buckets_path": "cpu_time_series>cpu"
+              }
+            },
+            "cpu_time_series": {
+              "date_histogram": {
+                "field": "@timestamp",
+                "calendar_interval": "minute"
+              },
+              "aggs": {
+                "max_cpu": {
+                  "max": {
+                    "field": "system.process.cpu.total.value"
+                  }
+                },
+                "cpu_derivative": {
+                  "derivative": {
+                    "buckets_path": "max_cpu",
+                    "gap_policy": "skip",
+                    "unit": "10s"
+                  }
+                },
+                "cpu": {
+                  "bucket_script": {
+                    "buckets_path": {
+                      "cpu_total": "cpu_derivative[normalized_value]"
+                    },
+                    "script": {
+                      "source": "if (params.cpu_total > 0) { return params.cpu_total / params._interval }",
+                      "lang": "painless",
+                      "params": {
+                        "_interval": 10000
+                      }
+                    },
+                    "gap_policy": "skip"
+                  }
+                }
+              }
+            }
+          }
+        }
+      }
+    }
+  }
+}
+```
+
+
+### Agent and Beats collect CPU metrics using `elastic-agent-system-metrics`
+
+At startup, both Agent and Beats call the [`SetupMetrics` function](https://github.com/elastic/elastic-agent-system-metrics/blob/085e4529f3c4f91dd377cadbbe7a2bf321989438/report/setup.go#L49)
+from the `github.com/elastic/elastic-agent-system-metrics/report` package.  This function registers a function with the
+monitoring registry. Whenever this function is called, it calculates and reports CPU and other metrics for the process in
+question.  The calculation of CPU (and other) metrics depends on the OS the Agent or Beat process is running on.
+
+#### Collecting CPU usage metrics on Linux
+
+On Linux, CPU usage metrics are collected by [reading the `/proc/$PID/stat` file](https://github.com/elastic/elastic-agent-system-metrics/blob/085e4529f3c4f91dd377cadbbe7a2bf321989438/metric/system/process/process_linux_common.go#L351).
+This file contains whitespace-delimited values (fields) for various process metrics and other information. The field at
+index 13 (0-based indexing) is the number of CPU ticks utilized by the process in user-space since it was started. The
+field at index 14 is the number of CPU ticks utilized by the process in kernel-space since it was started.  As such, both
+fields contain counter metrics. Both fields show the total number of CPU ticks consumed by the process across all available
+cores (as opposed to showing normalized, per-core, values; [proof](https://gist.github.com/ycombinator/d55d884ec979fb86360a00b57f807de3)).
+
+We want to convert these tick values into milliseconds so it becomes easier to figure out what percentage of CPU was
+utilized by the process over a given period of time.  For example, if the process utilized 120 milliseconds of CPU time
+over a period of 5 minutes, that would be a CPU utilization of 120 / (5 * 60 * 1000) = 0.0004 = 0.04%.
+
+On a typical Linux host, there are 100 ticks per second. The actual value can be checked by running `getconf CLK_TCK`.
+Therefore, if a process utilized T ticks, say in user-space, we can say that the process utilized (T / 100) seconds of
+CPU time == (T / 100) * 1000 milliseconds of CPU time. We do this [conversion](https://github.com/elastic/elastic-agent-system-metrics/blob/085e4529f3c4f91dd377cadbbe7a2bf321989438/metric/system/process/process_linux_common.go#L374-L375)
+from ticks to milliseconds for both user-space and kernel-space CPU utilization.
+
+Finally, we [sum up](https://github.com/elastic/elastic-agent-system-metrics/blob/085e4529f3c4f91dd377cadbbe7a2bf321989438/metric/system/process/process_linux_common.go#L376)
+the user-space and kernel-space CPU utilization (which is now in milliseconds) to arrive at the total CPU utilization.
+
+##### Comparison between metrics in `top` output and in `/proc/$PID/stat` file
+
+The `%CPU` reported for a process in `top` or `htop` output is in the range `[0, n*100]`, where `n` is the number of cores
+available on the machine. For example, if a process runs two threads on a two-core machine, with each thread utilizing
+about 60% of each core, `top` or `htop` will report `%CPU` as `120.0` (or close to it).
+
+Applying the calculations from the previous section to corresponding values in the process's `/proc/{pid}/stat` file,
+the results match up with what `top` or `htop` report.
+
+### Metricbeat collects CPU metrics for Agent and the Beats it manages
+
+There is one input in particular in the Agent policy that ultimately generates the data for the above ES query made by
+the Fleet UI. This input is of type `http/metrics`, use the `monitoring` output, and has `id` = `metrics-monitoring-agent`.
+A Metricbeat process is spawned for this input.
+
+There are multiple inputs in the Metricbeat configuration that generate the data for the ES query.
+* One input is for generating data for the Agent itself. This input will have `namespace` = `agent` and
+  `id` = `metrics-monitoring-agent`.
+* The remaining inputs will generate data for the various Beats managed by Agent. The number of inputs depends on the
+  number of Beats. These inputs will have `namespace` = `agent` and `id` = `metrics-monitoring-*beat-$n`, where `$n`
+  is the 1-based index of the Beat.
+
+All these inputs run the `http` Metricbeat module, `json` metricset, and poll `$hostname/stats` endpoint every minute,
+where `$hostname` is either the TCP address or unix socket path of the Agent's HTTP API or the Beats' HTTP APIs. Each
+input has a `copy_fields` processor that copies the value of the `http.agent.beat.cpu` field to the `system.process.cpu` field.
+
+Since the ES query aggregates on the `system.process.cpu.total.value` field, the corresponding field in the
+`$hostname/stats` API response that we're interested in is `.beat.cpu.total.value`. The `.beat.cpu.total.value` returns
+a counter value representing the total (user-space + kernel-space) duration, in milliseconds, spent by the Agent or Beat
+utilizing the CPU since the process was started. More on how this duration is calculated in the next section.
+
+#### Comparison between metrics in `/proc/$PID/stat` file and in Agent + Beats `/stats` API output
+
+Using the following script for a host running Agent and one child Beat (excluding any monitoring Beats), we can see that
+the metrics in the `/proc/$PID/stat` file match up with those in the Agent + Beats `/stats` API output.
+
+```shell
+#!/bin/bash
+
+BEAT_CPU_MS_TOTAL=$(sudo curl -s -X GET --unix-socket '/opt/Elastic/Agent/data/tmp/PGwsYWcynGUYZEjD872Gs-npqbv-30jS.sock' 'http:/f/stats' | jq '.beat.cpu.total.value')
+AGENT_CPU_MS_TOTAL=$(sudo curl -s http://localhost:6791/stats  | jq '.beat.cpu.total.value')
+
+echo "Stats from API outputs: $(($BEAT_CPU_MS_TOTAL + $AGENT_CPU_MS_TOTAL))";
+
+AGENT_PID=403165
+
+AGENT_USER_TICKS=$(cat /proc/$AGENT_PID/stat | cut -d' ' -f14)
+AGENT_SYSTEM_TICKS=$(cat /proc/$AGENT_PID/stat | cut -d' ' -f15)
+AGENT_TOTAL_TICKS=$(($AGENT_USER_TICKS + $AGENT_SYSTEM_TICKS))
+AGENT_TOTAL_MS=$(($AGENT_TOTAL_TICKS * 1000 / 100))
+
+#echo "Agent total ticks: $AGENT_TOTAL_TICKS"
+#echo "Agent total ms: $AGENT_TOTAL_MS"
+
+BEAT_PID=431834
+
+BEAT_USER_TICKS=$(cat /proc/$BEAT_PID/stat | cut -d' ' -f14)
+BEAT_SYSTEM_TICKS=$(cat /proc/$BEAT_PID/stat | cut -d' ' -f15)
+BEAT_TOTAL_TICKS=$(($BEAT_USER_TICKS + $BEAT_SYSTEM_TICKS))
+BEAT_TOTAL_MS=$(($BEAT_TOTAL_TICKS * 1000 / 100))
+
+#echo "Beat total ticks: $BEAT_TOTAL_TICKS"
+#echo "Beat total ms: $BEAT_TOTAL_MS"
+
+echo "Stats from /proc/PID/stats files: $(($AGENT_TOTAL_MS + $BEAT_TOTAL_MS))"
+```
+
+### Comparison between metrics in Agent + Beats `/stats` API output and in Elasticsearch `metrics-elastic_agent*` indices
+
+This comparison is relatively easy to make.
+
+First, we call the `/stats` APIs on the machine where Agent and its Beats are running. For example, with an Agent running
+one Beat (excluding monitoring Beats):
+
+```shell
+$ sudo curl -s http://localhost:6791/stats  | jq '.beat.cpu.total.value'
+34810
+$ sudo curl -s -X GET --unix-socket '/opt/Elastic/Agent/data/tmp/PGwsYWcynGUYZEjD872Gs-npqbv-30jS.sock' 'http:/f/stats' | jq '.beat.cpu.total.value'
+795000
+```
+
+Then we call the Elasticsearch `_search` API on `metrics-elastic_agent*` indices, keeping the query filters the same as
+the query being done by Fleet UI, but only considering the latest documents for each `elastic_agent.process`, since the
+CPU utilization metrics are counter metrics.
+
+```
+curl -s -u $ES_USER:$ES_PASS -H 'Content-Type: application/json' 'https://test-cpu.es.us-central1.gcp.cloud.es.io:9243/metrics-elastic_agent*/_search' -d '{
+  "sort": [
+    {
+      "@timestamp": {
+        "order": "desc"
+      }
+    }
+  ],
+  "collapse": {
+    "field": "elastic_agent.process"
+  },
+  "_source": [
+    "@timestamp",
+    "system.process.cpu.total.value"
+  ],
+  "query": {
+    "bool": {
+      "must": [
+        {
+          "terms": {
+            "_tier": [
+              "data_hot"
+            ]
+          }
+        },
+        {
+          "range": {
+            "@timestamp": {
+              "gte": "now-5m"
+            }
+          }
+        },
+        {
+          "terms": {
+            "elastic_agent.id": [
+              "62efabf2-21ec-4764-b5b5-32f7c6ce509b"
+            ]
+          }
+        },
+        {
+          "bool": {
+            "filter": [
+              {
+                "bool": {
+                  "should": [
+                    {
+                      "term": {
+                        "data_stream.dataset": "elastic_agent.elastic_agent"
+                      }
+                    }
+                  ]
+                }
+              }
+            ]
+          }
+        }
+      ]
+    }
+  }
+}' | jq -r '.hits.hits[]._source | [ ."@timestamp", .system.process.cpu.total.value ] | @tsv'
+```
+```
+2024-01-05T23:21:59.164Z	794990
+2024-01-05T23:21:59.164Z	34810
+```
+
+We can see that the metrics in the Agent + Beats `/stats` API outputs match up with those in the Elasticsearch
+`metrics-elastic_agent*` indices.
+
+### Comparison between data in `metrics-elastic_agent*` indices and data shown in Fleet UI
+
+Given the findings in the previous sections, and looking at the Elasticsearch query, it follows that the CPU usage
+metrics shown in the Fleet UI for every Agent are computed as follows from the data in the `metrics-elatic_agent*` indices:
+
+1. For the Agent process and for every Beat process that's managed by Agent:
+   1. CPU utilization (in milliseconds; stored in the `system.process.cpu.total.value` field) is considered over a 5-minute period.
+   2. This 5-minute period is broken down into 1-minute buckets. This is the `cpu_time_series` aggregation in the Elasticsearch
+      query.
+   3. Recall that the values stored in `system.process.cpu.total.value` field are counter values, meaning they are expected
+      to be non-decreasing over time. So to consider a single CPU utilization value for each 1-minute bucket, we take the
+      `max(system.process.cpu.total.value)`. This is the `max_cpu` aggregation in the Elasticsearch query.
+   4. Given that the values stored in `system.process.cpu.total.value` field are counter values, to get CPU utilization
+      (in milliseconds) for each 1-minute period, we subtract the `max(system.process.cpu.total.value)` of a 1-minute
+      bucket from the same value of the previous 1-minute bucket. Further, we recalculate this value for 10-second buckets
+      (essentially dividing the 1-minute bucket value by 6). This is all done by the `cpu_derivative` aggregation in the
+      Elasticsearch query.
+   5. this per-10-second CPU utilization (in milliseconds) is divided by `10 * 1000` (= 10 seconds expressed as milliseconds),
+      to get the _percentage_ CPU utilization over 10 elapsed seconds. This is the `cpu` aggregation in the Elasticsearch query.
+   6. So now we end up with five CPU utilization (in %) values, one for each minute.
+   7. These are averaged to arrive at the CPU utilization (in %) over five minutes. This is the `avg_cpu` aggregation in
+      the Elasticsearch query.
+2. The 5-minute average CPU utilization (in %) for each of the processes is summed up to arrive at a total CPU utilization
+   (in %) for the Agent process and all Beat processes managed by Agent. This is the `sum_cpu` aggregation in the Elasticsearch
+   query. The result is the total CPU utilization (in %) by Agent and it's child Beat processes across all available cores
+   (as opposed to being normalized per-core). For example, if a machine has 8 cores, the resulting value will be in the range of
+   (0%, 800%).
+
+## Observations
+
+* The CPU utilization (in %) of Agent and each of the Beat processes can vary wildly. Generally speaking, the Agent process
+  itself does not utilize much CPU. And each Beat process may utilize CPU depending on the type of computations it is performing
+  on the data.
+
+* Also, CPU utilization is rarely constant. If the output of `top` or `htop` for Agent and Beat processes is observed over
+  time, the CPU utilization % shown varies for each process over time.
+
+* To relate the output seen in `top` or `htop` for Agent and Beat processes with the single value shown in the Fleet UI,
+  one must observe the values in the `top` / `htop` output over five minutes, arrive at an average CPU utilization (in %)
+  value for each process, and sum it up.  The resulting value will roughly be equal to the single value shown for that Agent
+  in the Agent Listing page in the Fleet UI.
+
+## Suggested improvements
+
+* The `processes` aggregation in the Elasticsearch query should use the field `component.id` instead of `elastic_agent.process`.
+  This is to correctly account for multiple instances of the same type of Beat (e.g. Filebeat).  This can happen if there
+  are multiple outputs defined in the Agent policy and some inputs of a type (e.g. log) use one output while other inputs
+  of the same type use another output: https://github.com/elastic/kibana/issues/174458.
+
+* We should reconsider taking a 5-minute average in the Elasticsearch query made by the Fleet UI and instead take a
+  30-second or 1-minute average (making corresponding adjustments to the `calendar_interval` value in the `cpu_time_series`
+  aggregation). This would result in a value closer to what's observed in `top` / `htop` output: https://github.com/elastic/kibana/issues/174799.
+
+* We should link the value shown in the Fleet UI to a chart that breaks it down for that Agent by `component.id` over time,
+  so the user can see the CPU utilization per Agent component process, over time: https://github.com/elastic/kibana/issues/174800.
+
+* The tooltip shown with the "i" in the CPU column should explain that the value is sum of current CPU utilization (in %)
+  of all Agent component processes, ranging from 0 to (number of cores * 100): https://github.com/elastic/kibana/issues/174801.
+
+* We should enhance collection and aggregation to include CPU utilization for Agent components managed by the service
+  runtime (e.g. Endpoint) as well, not just Agent components managed by
+  the command runtime (e.g. Beats) as we do today: https://github.com/elastic/elastic-agent/issues/4083.
+
+* We should enhance collection to include CPU utilization for Agent monitoring components so their contributions are also
+  counted: https://github.com/elastic/elastic-agent/issues/4082.