[Kubernetes Integration] Investigate Elastic Agent API calls and check memory consumption #4122

MichaelKatsoulis · 2024-01-23T09:56:33Z

Leverage APM Tracing(#2612 (comment)) in order to investigate:

Elastic Agent API calls towards k8s API
check memory consumption

Update

Background

get the base line of current resource usage (8.12.x) and the method to measure it, just for the internal understanding

Goals

Understand internally how to get metrics:
Measure:
a) Memory
we are not splitting sub-processes for now, just elastic-agent

b) CPU
c) API calls (from both agent and underlying beats)

Actions

How to get all the required Measurements

are we required to use system integration - yes, we want to get info over time
but how to get mem info for the elastic-agent - k8s related providers
cluster only with agent with empty policy - check the base line of resource usage + api calls (with audit logs)
enable k8s integration - check the mem/cpu change

The scenarios to measure memory, cpu usage and API calls:
1 node cluster:

elastic-agent with default metrics 1 node cluster and 50 pods

leave system integration enabled

elastic-agent with logs 1 node cluster: X rate of logs with 50pods

Repeat above with 5 node cluster with 50pods

elastic-agent with default metrics 1 node cluster and 50 pods
elastic-agent with logs 1 node cluster: X rate of logs with 50pods

tetianakravchenko · 2024-02-12T19:21:50Z

Elastic Agent API calls towards k8s API:

Leverage APM Tracing

Some information on this:

there should be created span(s) and used a client that is aware of APM instrumentation
in elastic-agent this is not much used in general - there are few calls like httpcommon.WithAPMHTTPInstrumentation() (elastic-agent-libs dependency) (k8s client uses an http client under the hood, but not sure how easy it would be to use WithAPMHTTPInstrumentation)
should be explicitly added spans like here https://github.com/elastic/elastic-agent/blob/main/internal/pkg/agent/application/upgrade/upgrade.go#L149 (this is also not much used and seems that after initial PR it was never touched, also to keep in mind: APM Agent for Go - https://pkg.go.dev/go.elastic.co/apm used now in elastic-agent is outdated and should be used https://pkg.go.dev/go.elastic.co/apm/v2 instead)
one of the main goals of the Propagate apm config #3223 was to propagate APM tracing configuration to sub-processes (beats processes) via the control protocol - but the instrumentation must be added to the beats (seems that instrumentation is only available for Elasticsearch output - https://www.elastic.co/guide/en/beats/metricbeat/current/configuration-instrumentation.html)

Universal profiling

On kind (1.29) getting this error:

time="2024-02-12T16:40:34.235861244Z" level=error msg="Failed to load eBPF tracer: failed to read kernel modules: unexpected line in modules: 'selfowner 36864 - - Live 0xffffffffc05cd000 (O)'"

Tested on eks:

possible to filter by the container name
but how to filter specific metricbeat?
Universal Profiling does not yet cover or provide memory usage, disk I/O or network bandwidth

Audit logs analysis
@gsantoro is working on it

check memory consumption

Universal Profiling does not yet cover or provide memory usage

Memory consumption consist of 2 parts:

metricbeat process - all kubernetes related datastreams will be running as a dedicated process:

root          40  0.2  5.1 1502252 202888 ?      Sl   16:46   0:08 /usr/share/elastic-agent/data/elastic-agent-9db552/components/metricbeat -E setup.ilm.enabled=false -E setup.template.enabled=false -E management.enabled=true -E management.restart_on_output_change=true -E logging.level=info -E logging.to_stderr=true -E gc_percent=${METRICBEAT_GOGC:100} -E metricbeat.config.modules.enabled=false -E http.enabled=true -E http.host=unix:///etc/elastic-agent/data/tmp/wyZuDAx8vsqXD4cvx932213NPR-EMC2J.sock -E http.pprof.enabled=true -E path.data=/etc/elastic-agent/data/run/kubernetes/metrics-default

so can be used top -p <pid>, ps -p <pid> -o %mem,%cpu or the system integration (need to check)

For more info:

curl -v -o mem.pprof.gz --unix-socket /etc/elastic-agent/data/tmp/wyZuDAx8vsqXD4cvx932213NPR-EMC2J.sock http://localhost/debug/pprof/heap

elastic-agent - I believe there is no other way except to the agent pprof

FYI:

it is also possible to get the cpu pprof for elastic-agent:

elastic-agent diagnostics --cpu-profile

axw · 2024-02-14T08:42:56Z

one of the main goals of the #3223 was to propagate APM tracing configuration to sub-processes (beats processes) via the control protocol - but the instrumentation must be added to the beats (seems that instrumentation is only available for Elasticsearch output - https://www.elastic.co/guide/en/beats/metricbeat/current/configuration-instrumentation.html)

Right, so there's a couple of other things we need beyond #3223:

we need libbeat to consume the APM tracing config, and expose it to users: Consume APM configuration over the elastic-agent control protocol apm-server#11381 (comment)
we need to instrument libbeat/beats to create spans as you said; for a fully working solution this requires the point above

tetianakravchenko · 2024-03-12T10:03:22Z

Test scenario 1:
1 node k8s cluster
stack - 8.12.1
Additionally installes:

KSM
metrics-server:

kubectl apply -f https://github.com/kubernetes-sigs/metrics-server/releases/latest/download/components.yaml
kubectl patch -n kube-system deployment metrics-server --type=json \
  -p '[{"op":"add","path":"/spec/template/spec/containers/0/args/-","value":"--kubelet-insecure-tls"}]'

kubernetes-dashboard:

helm upgrade --install kubernetes-dashboard kubernetes-dashboard/kubernetes-dashboard --create-namespace --namespace kubernetes-dashboard --set=service.externalPort=8080,resources.limits.cpu=200m,metricsScraper.enabled=true
(also need to create token to access)

deploy extra pods: ./stress_test_k8s --deployments=4 --namespaces=10 --podlabels=4 --podannotations=4

empty policy:

API calls:

date; cat /var/log/kubernetes/kube-apiserver-audit.log | grep -a '"stage":"ResponseComplete"' | grep '"username":"system:serviceaccount:kube-system:elastic-agent"' | grep -v "/apis/coordination.k8s.io/v1/namespaces/kube-system/leases/elastic-agent-cluster-leader" | wc -l
Fri Mar  8 13:07:19 UTC 2024 - 47     <---- after the start
Fri Mar  8 13:10:05 UTC 2024 - 47
Fri Mar  8 13:12:18 UTC 2024 - 47
Fri Mar  8 13:12:49 UTC 2024 - 49
Fri Mar  8 13:13:10 UTC 2024 - 51
Fri Mar  8 13:17:54 UTC 2024 - 60

Note: skipping leader-election api calls - it cause 1 api call/sec

cpu/memory within: 3m 224Mi-230Mi

Add kubernetes integration

api calls

date; cat /var/log/kubernetes/kube-apiserver-audit.log | grep -a '"stage":"ResponseComplete"' | grep '"username":"system:serviceaccount:kube-system:elastic-agent"' | grep -v "/apis/coordination.k8s.io/v1/namespaces/kube-system/leases/elastic-agent-cluster-leader" | wc -l
Fri Mar  8 13:26:36 UTC 2024 - 205
Fri Mar  8 13:27:35 UTC 2024 - 207
Fri Mar  8 13:28:50 UTC 2024 - 209
Fri Mar  8 13:50:13 UTC 2024 - 455
Fri Mar  8 14:20:20 UTC 2024 - 797

cpu/mem: 46m-206m 592Mi- 665Mi

tetianakravchenko · 2024-04-16T09:22:28Z

Logs only:
1 node k8s cluster
stack - 8.13.1.
the same setup as above

API calls:

root@kind-control-plane:/# date; cat /var/log/kubernetes/kube-apiserver-audit.log | grep -a '"stage":"ResponseComplete"' | grep '"username":"system:serviceaccount:kube-system:elastic-agent"' | grep -v "/apis/coordination.k8s.io/v1/namespaces/kube-system/leases/elastic-agent-cluster-leader" | wc -l
Tue Apr 16 08:09:13 UTC 2024 - 42 <---- after the start
Tue Apr 16 08:15:41 UTC 2024 - 47
Tue Apr 16 08:24:25 UTC 2024 - 69
Tue Apr 16 08:46:46 UTC 2024 - 116   <-- more traffic enabled
Tue Apr 16 08:54:48 UTC 2024 - 135
Tue Apr 16 09:06:54 UTC 2024 - 158
Tue Apr 16 09:16:42 UTC 2024 - 180

kubectl top pod elastic-agent-crbdp
NAME                  CPU(cores)   MEMORY(bytes)   
elastic-agent-crbdp   16m          379Mi           (about 100-150 logs/min)

NAME                  CPU(cores)   MEMORY(bytes)   
elastic-agent-crbdp   463m-600m         473Mi-530Mi           (~180.000 logs/min)

after enabling more traffic/increasin amount of logs:

MichaelKatsoulis added the Team:Cloudnative-Monitoring Label for the Cloud Native Monitoring team label Jan 23, 2024

MichaelKatsoulis mentioned this issue Jan 23, 2024

[Meta]Investigate resource consumption of Elastic Agent with K8s Integration #3801

Open

10 tasks

MichaelKatsoulis assigned tetianakravchenko Jan 23, 2024

tetianakravchenko changed the title ~~[Kubernetes Integration] Leverage APM Tracing in order to investigate Elastic Agent API calls and check memory consumption~~ [Kubernetes Integration] Investigate Elastic Agent API calls and check memory consumption Feb 12, 2024

MichaelKatsoulis closed this as completed Apr 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Kubernetes Integration] Investigate Elastic Agent API calls and check memory consumption #4122

[Kubernetes Integration] Investigate Elastic Agent API calls and check memory consumption #4122

MichaelKatsoulis commented Jan 23, 2024 •

edited by tetianakravchenko

Loading

tetianakravchenko commented Feb 12, 2024 •

edited

Loading

axw commented Feb 14, 2024

tetianakravchenko commented Mar 12, 2024

tetianakravchenko commented Apr 16, 2024 •

edited

Loading

[Kubernetes Integration] Investigate Elastic Agent API calls and check memory consumption #4122

[Kubernetes Integration] Investigate Elastic Agent API calls and check memory consumption #4122

Comments

MichaelKatsoulis commented Jan 23, 2024 • edited by tetianakravchenko Loading

Background

Goals

Actions

tetianakravchenko commented Feb 12, 2024 • edited Loading

Elastic Agent API calls towards k8s API:

check memory consumption

axw commented Feb 14, 2024

tetianakravchenko commented Mar 12, 2024

tetianakravchenko commented Apr 16, 2024 • edited Loading

MichaelKatsoulis commented Jan 23, 2024 •

edited by tetianakravchenko

Loading

tetianakravchenko commented Feb 12, 2024 •

edited

Loading

tetianakravchenko commented Apr 16, 2024 •

edited

Loading