-
Notifications
You must be signed in to change notification settings - Fork 2.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[processor/k8sattributes] Excessive heap usage from k8sattributes
processor after 0.78.0 upgrade
#23067
Comments
Pinging code owners:
See Adding Labels via Comments if you do not have permissions to add labels yourself. |
@fatsheep9146, can you please work on this issue and see what allocations we can avoid? |
@sirianni thanks for reporting!
Does it go back to the original (pre-upgrade) level? |
Is there a way to get the (perhaps imperfect?) |
No problem, I will check about this. @dmitryax |
@fatsheep9146 I think I know what's going on. Eric runs it on the agent with a node selector, so the pods informer has a limited set of pods to watch. The replicaset informer, on the other hand, cannot have a node selector, so we keep a state about all the replicasets in the whole cluster, which can be pretty big. One option is to get rid of the replicaset informer and populate the pod -> rs -> deployment data right in the |
The replicaset informer was added due to the issue of , which want to add k8s.deployment.uid support. In that issue, we have discussed about how to produce the So I think maybe we can do like this way
@dmitryax WDYT |
I don't believe we have to use the replicaset informer. We don't need any dynamic data from it that can be changed without rolling the pod. So we need to watch for pods only, not replicasets. We can fetch replicaset and deployment data only when pod got changed or added. To avoid high-frequency calls to k8s API for a pod in crash loop, we can utilize some lru cache for replicasets instead of the informer. |
Ok, I got your point, you mean we don't need cache all replicaset data through informer, but we should use lru cache to cache the replicasets which has pods we already watched. And only update the caches when a new pod is added. |
Yes
Not sure I understand this. We have a handler for a new pod, as well as a handler for a pod update. The first one would always be a cache miss for the replicaset cache, so we need to make another call to k8s api to fill it. The second one is likely always a cache hit. |
Ok, I basically understand your idea, I will try to fix this in this way. Speak of LRU cache, do you know any component in collector-contrib has similar implementation that I can import? |
@fatsheep9146 looks like |
Thanks! @gbbr |
In our fork of k8sattributes, we've had success reducing memory usage of the informer stores themselves by using We've only done it for Pods in a gateway context, but in principle it could be done for every resource type we start informers for. The change is also very simple conceptually, as it's just getting rid of fields we don't use. Think it's worth pursuing @dmitryax if we're concerned about memory usage? I can open a new issue for it. |
@swiatekm-sumo sounds like a good idea to reduce the size of the items fetched by the informer. It'd be great if you can upstream it. But we still need to solve this issue to avoid fetching all the replicasets in the agent mode. @swiatekm-sumo is there any other reason you have this component forked? |
My hope was to keep the ReplicaSet footprint low enough that we could just store all of them. Your proposal with the LRU cache and fetching ReplicaSets from the API Server directly sounds significantly more complicated than what we have now.
It was originally forked back when the contrib processor only included Pod metadata, I think, though I wasn't here for it. At this point, the only reasons are the memory usage and some owner metdata - I think only CronJob and Service names. I would very much like to upstream those and get rid of the fork. |
We can try this first and see if it solves the high memory issue. cc @fatsheep9146
We have |
I agree with @dmitryax that if the No matter how small the footprint is, it's very desirable to have memory usage be Also, when running in agent / DaemonSet mode, it's critical for the collector's footprint to be as low as possible. Storing any unnecessary metadata should be avoided. |
This is a good point, and I agree that controlling resource consumption for DaemonSet Pods is very important. For the SetTransform optimization, this isn't an either/or thing - we can do it anyway for Pods, and reduce the memory footprint further even without having enabled any metadata requiring ReplicaSets. I do want to point out though, that in this case (of the solution discussed in this issue) the agent and gateway requirements are somewhat opposed. The agent running as a DaemonSet wants to only store the metadata relevant to Pods on its Node, and is (presumably, as we haven't tested this) fine doing some synchronous requests to the API Server if that means avoiding having to store global cluster state. On the other hand, the gateway needs the whole cluster state, and is definitely not fine making a single synchronous request for each ReplicaSet in a busy cluster. In an ideal world, we would use different approaches for these use cases, but that significantly increases the maintenance burden and complexity of the processor. So we might need to compromise one way or another. Out of curiosity, if we implemented the SetTransform optimizations and achieved lower memory consumption than in 0.77.0, even while storing all ReplicaSets, would you be unhappy with that outcome? |
@swiatekm-sumo, thanks for pointing this out. I agree differentiating the logic based on the collector's deployment topology is not an ideal solution. Let's apply your optimization both to pods and replica sets informers first and see what the outcome will be. @swiatekm-sumo, do you have a chance to submit a PR given that you did in your fork? |
ping @sirianni , do you have tried this? |
We are still seeing significantly elevated CPU usage after the upgrade due to memory allocations / GC churn. I don't think it's quite as bad as before, but it's still elevated enough that we'll have to resort to a workaround of disabling that I realize the code complexity you are dealing with trying to balance "agent mode" vs. "gateway mode", but I would very much appreciate restoring the previous behavior that didn't have this issue. I'd rather have a slightly imperfect |
|
I'd attribute the high churn to the creation of new instances in the |
We could. I originally didn't because it's more complicated to reason about and maintain. If we do want to change it, I'd like to see a benchmark showing it actually shows improvement before we merge the change. |
@sirianni if you don't want to wait until we figure this out and want to upgrade while keeping the old behaviour, I'm pretty sure you can emulate it using the transform processor. You can:
|
Thanks @swiatekm-sumo ! Even with extraction of
|
Strange, it should only create that one if you have either |
Here is the YAML for anyone else reading transform/k8s_attributes:
metric_statements:
- context: resource
statements:
- set(attributes["k8s.deployment.name"], attributes["k8s.replicaset.name"])
- replace_pattern(attributes["k8s.deployment.name"], "^(.*)-[0-9a-zA-Z]+$", "$$1")
- delete_key(attributes, "k8s.replicaset.name")
trace_statements:
- context: resource
statements:
- set(attributes["k8s.deployment.name"], attributes["k8s.replicaset.name"])
- replace_pattern(attributes["k8s.deployment.name"], "^(.*)-[0-9a-zA-Z]+$", "$$1")
- delete_key(attributes, "k8s.replicaset.name") |
This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping Pinging code owners:
See Adding Labels via Comments if you do not have permissions to add labels yourself. |
This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping Pinging code owners:
See Adding Labels via Comments if you do not have permissions to add labels yourself. |
This is still needed, should not be stale. |
This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping Pinging code owners:
See Adding Labels via Comments if you do not have permissions to add labels yourself. |
Issue coming up to 1 year birthday 🍰 |
Thinking about this more, I'd be surprised if optimizing the transform function would help much here as suggested by @dmitryax in #23067 (comment). We're really only allocating an additional struct, all the data we put in it is already allocated and shouldn't trigger any copies. My expectation would be that the churn is simply from deserializing ReplicaSet structs in the Watch. I don't think we can do much about that cost, but maybe we can update less frequently? For Deployment/StatefulSet/DaemonSet names, we only care about the ReplicaSet owner references. While those references are technically mutable, they will only really change due to relatively rare events. So maybe we could set the ReplicaSet watch to only trigger on Create, and let periodic resync take care of any owner reference discrepancies? |
Having looked into this a bit more, I don't think the K8s API lets us have a watch on only ADDED notifications. We can of course ignore the notification based on type, but at that point we've already deserialized it, and probably paid most of the final cost of processing it. I think what we could do instead is ask for only metadata updates for ReplicaSets (and probably other resource types as well). client-go has a metadata-only informer, and for a lot of resources, metadata is all we need. That should help with memory usage and GC pressure. |
This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping Pinging code owners:
See Adding Labels via Comments if you do not have permissions to add labels yourself. |
This issue has been closed as inactive because it has been stale for 120 days with no activity. |
Component(s)
processor/k8sattributes
What happened?
Description
We recently upgraded the collector to 0.78.0 and are noticing a massive increase in CPU utilization. It seems that the main cause is increase in GC activity. Looking at the heap profile I see k8sattributes processor as a top culprit
Collector version
0.78.2
Environment information
Environment
Kubernetes
OpenTelemetry Collector configuration
Log output
No response
Additional context
This appears to be caused by #14003 where the processor started tracking replica sets.
CPU usage before/after upgrade
The text was updated successfully, but these errors were encountered: