Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[processor/k8sattributes] Excessive heap usage from k8sattributes processor after 0.78.0 upgrade #23067

Closed
sirianni opened this issue Jun 5, 2023 · 45 comments
Assignees
Labels
bug Something isn't working closed as inactive priority:p1 High processor/k8sattributes k8s Attributes processor Stale

Comments

@sirianni
Copy link
Contributor

sirianni commented Jun 5, 2023

Component(s)

processor/k8sattributes

What happened?

Description

We recently upgraded the collector to 0.78.0 and are noticing a massive increase in CPU utilization. It seems that the main cause is increase in GC activity. Looking at the heap profile I see k8sattributes processor as a top culprit

(pprof) top15 -cum
Showing nodes accounting for 0, 0% of 235.31MB total
Dropped 133 nodes (cum <= 1.18MB)
Showing top 15 nodes out of 111
      flat  flat%   sum%        cum   cum%
         0     0%     0%   206.69MB 87.84%  sigs.k8s.io/json/internal/golang/encoding/json.(*decodeState).object
         0     0%     0%   206.69MB 87.84%  sigs.k8s.io/json/internal/golang/encoding/json.(*decodeState).unmarshal
         0     0%     0%   206.69MB 87.84%  sigs.k8s.io/json/internal/golang/encoding/json.(*decodeState).value
         0     0%     0%   206.69MB 87.84%  sigs.k8s.io/json/internal/golang/encoding/json.Unmarshal
         0     0%     0%   205.69MB 87.42%  k8s.io/apimachinery/pkg/runtime/serializer/json.(*Serializer).unmarshal
         0     0%     0%   205.69MB 87.42%  sigs.k8s.io/json.UnmarshalCaseSensitivePreserveInts (inline)
         0     0%     0%   204.69MB 86.99%  k8s.io/apimachinery/pkg/runtime.WithoutVersionDecoder.Decode
         0     0%     0%   204.69MB 86.99%  k8s.io/apimachinery/pkg/runtime/serializer/json.(*Serializer).Decode
         0     0%     0%   203.69MB 86.57%  sigs.k8s.io/json/internal/golang/encoding/json.(*decodeState).array
         0     0%     0%   187.18MB 79.55%  k8s.io/client-go/rest.Result.Into
         0     0%     0%   186.18MB 79.12%  k8s.io/client-go/kubernetes/typed/apps/v1.(*replicaSets).List
         0     0%     0%   182.68MB 77.63%  github.com/open-telemetry/opentelemetry-collector-contrib/processor/k8sattributesprocessor/internal/kube.replicasetListFuncWithSelectors.func1

Collector version

0.78.2

Environment information

Environment

Kubernetes

OpenTelemetry Collector configuration

processors:
  k8sattributes:
    filter:
      node_from_env_var: K8S_NODE_NAME
    pod_association:
    - sources:
      - from: resource_attribute
        name: k8s.pod.uid
    - sources:
      - from: resource_attribute
        name: k8s.namespace.name
      - from: resource_attribute
        name: k8s.pod.name
    - sources:
      - from: resource_attribute
        name: k8s.pod.ip
    - sources:
      - from: connection
    extract:
      labels:
      - key: app.kubernetes.io/name
        from: pod

Log output

No response

Additional context

This appears to be caused by #14003 where the processor started tracking replica sets.

CPU usage before/after upgrade
image

@sirianni sirianni added bug Something isn't working needs triage New item requiring triage labels Jun 5, 2023
@github-actions github-actions bot added the processor/k8sattributes k8s Attributes processor label Jun 5, 2023
@github-actions
Copy link
Contributor

github-actions bot commented Jun 5, 2023

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

@sirianni
Copy link
Contributor Author

sirianni commented Jun 5, 2023

Disabling metadata extraction for k8s.deployment.name and k8s.deployment.uid significantly improves CPU utilization

    extract:
      metadata:
      - k8s.namespace.name
      - k8s.pod.name
      - k8s.pod.uid
      - k8s.node.name
      - container.image.name
      - container.image.tag

image

@dmitryax
Copy link
Member

dmitryax commented Jun 5, 2023

@fatsheep9146, can you please work on this issue and see what allocations we can avoid?

@dmitryax
Copy link
Member

dmitryax commented Jun 5, 2023

@sirianni thanks for reporting!

Disabling metadata extraction for k8s.deployment.name and k8s.deployment.uid significantly improves CPU utilization

Does it go back to the original (pre-upgrade) level?

@sirianni
Copy link
Contributor Author

sirianni commented Jun 5, 2023

Is there a way to get the (perhaps imperfect?) k8s.deployment.name attribute back? We don't really care about the deployment UID but would like to avoid all of this overhead of replicaset bookkeeping.

@sirianni
Copy link
Contributor Author

sirianni commented Jun 5, 2023

Does it go back to the original (pre-upgrade) level?

Yes, it appears so.

Pre-upgrade --> 0.78.x

image

0.78.x --> 0.78.x with workaround

image

@fatsheep9146
Copy link
Contributor

@fatsheep9146, can you please work on this issue and see what allocations we can avoid?

No problem, I will check about this. @dmitryax

@dmitryax
Copy link
Member

dmitryax commented Jun 6, 2023

@fatsheep9146 I think I know what's going on. Eric runs it on the agent with a node selector, so the pods informer has a limited set of pods to watch. The replicaset informer, on the other hand, cannot have a node selector, so we keep a state about all the replicasets in the whole cluster, which can be pretty big.

One option is to get rid of the replicaset informer and populate the pod -> rs -> deployment data right in the Pod struct every time a pod added/updated.

@fatsheep9146
Copy link
Contributor

The replicaset informer was added due to the issue of , which want to add k8s.deployment.uid support. In that issue, we have discussed about how to produce the k8s.deployment.name

#12936 (comment)

So I think maybe we can do like this way

  • if only 'k8s.deployment.name' is specified, we use the pod struct to parse the deployment name with regex
  • if 'k8s.deployment.uid' is specified, we use replicaset informer.

@dmitryax WDYT

@dmitryax
Copy link
Member

dmitryax commented Jun 6, 2023

I don't believe we have to use the replicaset informer. We don't need any dynamic data from it that can be changed without rolling the pod. So we need to watch for pods only, not replicasets. We can fetch replicaset and deployment data only when pod got changed or added. To avoid high-frequency calls to k8s API for a pod in crash loop, we can utilize some lru cache for replicasets instead of the informer.

@dmitryax dmitryax added priority:p1 High and removed needs triage New item requiring triage labels Jun 6, 2023
@fatsheep9146
Copy link
Contributor

fatsheep9146 commented Jun 6, 2023

I don't believe we have to use the replicaset informer. We don't need any dynamic data from it that can be changed without rolling the pod. So we need to watch for pods only, not replicasets. We can fetch replicaset and deployment data only when pod got changed or added. To avoid high-frequency calls to k8s API for a pod in crash loop, we can utilize some lru cache for replicasets instead of the informer.

Ok, I got your point, you mean we don't need cache all replicaset data through informer, but we should use lru cache to cache the replicasets which has pods we already watched. And only update the caches when a new pod is added.
I thinks when the existing pod is updated, it should not indicate that the replicaset is also updated, because if the replicaset is changed, it should always create new pod and remove old pods.

@dmitryax

@dmitryax
Copy link
Member

dmitryax commented Jun 6, 2023

Ok, I got your point, you mean we don't need cache all replicaset data through informer, but we should use lru cache to cache the replicasets which has pods we already watched. And only update the caches when a new pod is added.

Yes

I thinks when the existing pod is updated, it should not indicate that the replicaset is also updated, because if the replicaset is changed, it should always create new pod and remove old pods.

Not sure I understand this. We have a handler for a new pod, as well as a handler for a pod update. The first one would always be a cache miss for the replicaset cache, so we need to make another call to k8s api to fill it. The second one is likely always a cache hit.

@fatsheep9146
Copy link
Contributor

Ok, I got your point, you mean we don't need cache all replicaset data through informer, but we should use lru cache to cache the replicasets which has pods we already watched. And only update the caches when a new pod is added.

Yes

I thinks when the existing pod is updated, it should not indicate that the replicaset is also updated, because if the replicaset is changed, it should always create new pod and remove old pods.

Not sure I understand this. We have a handler for a new pod, as well as a handler for a pod update. The first one would always be a cache miss for the replicaset cache, so we need to make another call to k8s api to fill it. The second one is likely always a cache hit.

Ok, I basically understand your idea, I will try to fix this in this way.

Speak of LRU cache, do you know any component in collector-contrib has similar implementation that I can import?

@dmitryax

@gbbr
Copy link
Member

gbbr commented Jun 6, 2023

@fatsheep9146 looks like github.com/hashicorp/golang-lru/simplelru is commonly used in this repository. Seems like a good option.

@fatsheep9146
Copy link
Contributor

@fatsheep9146 looks like github.com/hashicorp/golang-lru/simplelru is commonly used in this repository. Seems like a good option.

Thanks! @gbbr

@swiatekm
Copy link
Contributor

swiatekm commented Jun 6, 2023

In our fork of k8sattributes, we've had success reducing memory usage of the informer stores themselves by using SetTransform on the larger informers. SumoLogic/sumologic-otel-collector#593 made a very big difference for us in large clusters.

We've only done it for Pods in a gateway context, but in principle it could be done for every resource type we start informers for. The change is also very simple conceptually, as it's just getting rid of fields we don't use. Think it's worth pursuing @dmitryax if we're concerned about memory usage? I can open a new issue for it.

@dmitryax
Copy link
Member

dmitryax commented Jun 6, 2023

@swiatekm-sumo sounds like a good idea to reduce the size of the items fetched by the informer. It'd be great if you can upstream it. But we still need to solve this issue to avoid fetching all the replicasets in the agent mode.

@swiatekm-sumo is there any other reason you have this component forked?

@swiatekm
Copy link
Contributor

swiatekm commented Jun 6, 2023

@swiatekm-sumo sounds like a good idea to reduce the size of the items fetched by the informer. It'd be great if you can upstream it. But we still need to solve this issue to avoid fetching all the replicasets in the agent mode.

My hope was to keep the ReplicaSet footprint low enough that we could just store all of them. Your proposal with the LRU cache and fetching ReplicaSets from the API Server directly sounds significantly more complicated than what we have now.

@swiatekm-sumo is there any other reason you have this component forked?

It was originally forked back when the contrib processor only included Pod metadata, I think, though I wasn't here for it.

At this point, the only reasons are the memory usage and some owner metdata - I think only CronJob and Service names. I would very much like to upstream those and get rid of the fork.

@dmitryax
Copy link
Member

dmitryax commented Jun 6, 2023

My hope was to keep the ReplicaSet footprint low enough that we could just store all of them. Your proposal with the LRU cache and fetching ReplicaSets from the API Server directly sounds significantly more complicated than what we have now.

We can try this first and see if it solves the high memory issue. cc @fatsheep9146

At this point, the only reasons are the memory usage and some owner metadata - I think only CronJob and Service names. I would very much like to upstream those and get rid of the fork.

We have k8s.cronjob.name already. Not sure about service names, given that one pod can serve several services. I believe someone raised that issue, we can reconsider it

@sirianni
Copy link
Contributor Author

sirianni commented Jun 6, 2023

keep the ReplicaSet footprint low enough that we could just store all of them.

I agree with @dmitryax that if the Node (or NodeFromEnvVar) filter is specified then the processor should make a very strong attempt to only store metadata for pods related to that node. We run some very large Kubernetes clusters (600+ nodes) and it's very inefficient and unnecessary for a DaemonSet collector to store metadata for resources not on its local node.

No matter how small the footprint is, it's very desirable to have memory usage be O(# pods on node) instead of O(# pods on cluster), since the former is much more fixed.

Also, when running in agent / DaemonSet mode, it's critical for the collector's footprint to be as low as possible. Storing any unnecessary metadata should be avoided.

@swiatekm
Copy link
Contributor

swiatekm commented Jun 6, 2023

keep the ReplicaSet footprint low enough that we could just store all of them.

I agree with @dmitryax that if the Node (or NodeFromEnvVar) filter is specified then the processor should make a very strong attempt to only store metadata for pods related to that node. We run some very large Kubernetes clusters (600+ nodes) and it's very inefficient and unnecessary for a DaemonSet collector to store metadata for resources not on its local node.

No matter how small the footprint is, it's very desirable to have memory usage be O(# pods on node) instead of O(# pods on cluster), since the former is much more fixed.

Also, when running in agent / DaemonSet mode, it's critical for the collector's footprint to be as low as possible. Storing any unnecessary metadata should be avoided.

This is a good point, and I agree that controlling resource consumption for DaemonSet Pods is very important. For the SetTransform optimization, this isn't an either/or thing - we can do it anyway for Pods, and reduce the memory footprint further even without having enabled any metadata requiring ReplicaSets.

I do want to point out though, that in this case (of the solution discussed in this issue) the agent and gateway requirements are somewhat opposed. The agent running as a DaemonSet wants to only store the metadata relevant to Pods on its Node, and is (presumably, as we haven't tested this) fine doing some synchronous requests to the API Server if that means avoiding having to store global cluster state. On the other hand, the gateway needs the whole cluster state, and is definitely not fine making a single synchronous request for each ReplicaSet in a busy cluster.

In an ideal world, we would use different approaches for these use cases, but that significantly increases the maintenance burden and complexity of the processor. So we might need to compromise one way or another.

Out of curiosity, if we implemented the SetTransform optimizations and achieved lower memory consumption than in 0.77.0, even while storing all ReplicaSets, would you be unhappy with that outcome?

@dmitryax
Copy link
Member

dmitryax commented Jun 6, 2023

@swiatekm-sumo, thanks for pointing this out. I agree differentiating the logic based on the collector's deployment topology is not an ideal solution. Let's apply your optimization both to pods and replica sets informers first and see what the outcome will be. @swiatekm-sumo, do you have a chance to submit a PR given that you did in your fork?

@fatsheep9146
Copy link
Contributor

I also agree with @swiatekm-sumo, the SetTransform optimization is beneficial to both gateway mode and daemonset mode.
I think the optimization for daemonset model is better to be based on the SetTransform optimization.

@dmitryax @sirianni

@fatsheep9146
Copy link
Contributor

ping @sirianni , do you have tried this?

@sirianni
Copy link
Contributor Author

sirianni commented Jul 11, 2023

We are still seeing significantly elevated CPU usage after the upgrade due to memory allocations / GC churn.

I don't think it's quite as bad as before, but it's still elevated enough that we'll have to resort to a workaround of disabling that k8s.deployment.name attribute. Again, this is a daemonset pod that runs co-located with all of our actual applications, so CPU and memory is at a premium.

I realize the code complexity you are dealing with trying to balance "agent mode" vs. "gateway mode", but I would very much appreciate restoring the previous behavior that didn't have this issue. I'd rather have a slightly imperfect k8s.deployment.name attribute (#12936 (comment)) then this increase in resource utilization.

image

@sirianni
Copy link
Contributor Author

$ go tool pprof ./bin/collector http://localhost:1777/debug/pprof/heap
File: collector
Build ID: 79c5900955e9865a447786690c0e370565a2dd56
Type: inuse_space
Time: Jul 11, 2023 at 1:13pm (EDT)
Entering interactive mode (type "help" for commands, "o" for options)
(pprof) top15 -cum
Showing nodes accounting for 11.51MB, 23.48% of 49.04MB total
Showing top 15 nodes out of 285
      flat  flat%   sum%        cum   cum%
         0     0%     0%    11.51MB 23.48%  k8s.io/client-go/tools/cache.(*DeltaFIFO).Resync
       1MB  2.04%  2.04%    11.51MB 23.48%  k8s.io/client-go/tools/cache.(*DeltaFIFO).queueActionLocked
         0     0%  2.04%    11.51MB 23.48%  k8s.io/client-go/tools/cache.(*DeltaFIFO).syncKeyLocked
         0     0%  2.04%    11.51MB 23.48%  k8s.io/client-go/tools/cache.(*Reflector).startResync
         0     0%  2.04%       11MB 22.44%  runtime.main
         0     0%  2.04%    10.51MB 21.44%  github.com/open-telemetry/opentelemetry-collector-contrib/processor/k8sattributesprocessor/internal/kube.New.func2
   10.51MB 21.44% 23.48%    10.51MB 21.44%  github.com/open-telemetry/opentelemetry-collector-contrib/processor/k8sattributesprocessor/internal/kube.removeUnnecessaryReplicaSetData (inline)
         0     0% 23.48%        6MB 12.24%  main.main
         0     0% 23.48%     5.53MB 11.29%  runtime.doInit
         0     0% 23.48%     5.52MB 11.25%  k8s.io/apimachinery/pkg/runtime.WithoutVersionDecoder.Decode
         0     0% 23.48%     5.52MB 11.25%  k8s.io/apimachinery/pkg/runtime/serializer/json.(*Serializer).Decode
         0     0% 23.48%     5.52MB 11.25%  k8s.io/apimachinery/pkg/runtime/serializer/json.(*Serializer).unmarshal
         0     0% 23.48%     5.52MB 11.25%  k8s.io/client-go/tools/cache.(*ListWatch).List
         0     0% 23.48%     5.52MB 11.25%  k8s.io/client-go/tools/cache.(*Reflector).list.func1
         0     0% 23.48%     5.52MB 11.25%  k8s.io/client-go/tools/cache.(*Reflector).list.func1.2

@dmitryax
Copy link
Member

dmitryax commented Jul 11, 2023

I'd attribute the high churn to the creation of new instances in the removeUnnecessaryReplicaSetData function. @swiatekm-sumo, maybe we can reuse exiting instances and cut off the redundant fields instead?

@swiatekm
Copy link
Contributor

I'd attribute the high churn to the creation of new instances in the removeUnnecessaryReplicaSetData function. @swiatekm-sumo, maybe we can reuse exiting instances and cut off the redundant fields instead?

We could. I originally didn't because it's more complicated to reason about and maintain. If we do want to change it, I'd like to see a benchmark showing it actually shows improvement before we merge the change.

@swiatekm
Copy link
Contributor

@sirianni if you don't want to wait until we figure this out and want to upgrade while keeping the old behaviour, I'm pretty sure you can emulate it using the transform processor. You can:

  1. Add k8s.replicaset.name to the metadata attributes in k8sattributes processor config. This does not require the ReplicaSet informer.
  2. Use transform processor to add k8s.deployment.name by doing a regex replace using the same pattern we used to use: ^(.*)-[0-9a-zA-Z]+$
  3. Drop k8s.replicaset.name

@sirianni
Copy link
Contributor Author

sirianni commented Jul 12, 2023

Thanks @swiatekm-sumo !

Even with extraction of k8s.deployment.name disabled, I'm still seeing some flaky behavior on some of our k8s clusters. This timeout seems to indicate that the ReplicaSet informer is still in use....?

Failed to watch *v1.ReplicaSet: failed to list *v1.ReplicaSet: Get "[https://10.253.0.1:443/apis/apps/v1/replicasets?limit=500&resourceVersion=0](https://10.253.0.1/apis/apps/v1/replicasets?limit=500&resourceVersion=0)": dial tcp 10.253.0.1:443: i/o timeout

@swiatekm
Copy link
Contributor

Strange, it should only create that one if you have either k8s.deployment.name or k8s.deployment.uid in your metadata attributes list.

@sirianni
Copy link
Contributor Author

emulate it using the transform processor

Here is the YAML for anyone else reading

  transform/k8s_attributes:
    metric_statements:
    - context: resource
      statements:
      - set(attributes["k8s.deployment.name"], attributes["k8s.replicaset.name"])
      - replace_pattern(attributes["k8s.deployment.name"], "^(.*)-[0-9a-zA-Z]+$", "$$1")
      - delete_key(attributes, "k8s.replicaset.name")
    trace_statements:
    - context: resource
      statements:
      - set(attributes["k8s.deployment.name"], attributes["k8s.replicaset.name"])
      - replace_pattern(attributes["k8s.deployment.name"], "^(.*)-[0-9a-zA-Z]+$", "$$1")
      - delete_key(attributes, "k8s.replicaset.name")

@github-actions
Copy link
Contributor

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

Copy link
Contributor

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

@github-actions github-actions bot added the Stale label Dec 19, 2023
@matej-g
Copy link
Contributor

matej-g commented Jan 18, 2024

This is still needed, should not be stale.

@github-actions github-actions bot removed the Stale label Jan 19, 2024
Copy link
Contributor

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

@robincw
Copy link

robincw commented May 20, 2024

Issue coming up to 1 year birthday 🍰

@swiatekm
Copy link
Contributor

Thinking about this more, I'd be surprised if optimizing the transform function would help much here as suggested by @dmitryax in #23067 (comment). We're really only allocating an additional struct, all the data we put in it is already allocated and shouldn't trigger any copies. My expectation would be that the churn is simply from deserializing ReplicaSet structs in the Watch.

I don't think we can do much about that cost, but maybe we can update less frequently? For Deployment/StatefulSet/DaemonSet names, we only care about the ReplicaSet owner references. While those references are technically mutable, they will only really change due to relatively rare events. So maybe we could set the ReplicaSet watch to only trigger on Create, and let periodic resync take care of any owner reference discrepancies?

@swiatekm
Copy link
Contributor

swiatekm commented Jul 3, 2024

Having looked into this a bit more, I don't think the K8s API lets us have a watch on only ADDED notifications. We can of course ignore the notification based on type, but at that point we've already deserialized it, and probably paid most of the final cost of processing it.

I think what we could do instead is ask for only metadata updates for ReplicaSets (and probably other resource types as well). client-go has a metadata-only informer, and for a lot of resources, metadata is all we need. That should help with memory usage and GC pressure.

Copy link
Contributor

github-actions bot commented Sep 2, 2024

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

@github-actions github-actions bot added the Stale label Sep 2, 2024
Copy link
Contributor

github-actions bot commented Nov 1, 2024

This issue has been closed as inactive because it has been stale for 120 days with no activity.

@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Nov 1, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working closed as inactive priority:p1 High processor/k8sattributes k8s Attributes processor Stale
Projects
None yet
Development

No branches or pull requests

7 participants