-
Notifications
You must be signed in to change notification settings - Fork 2.9k
[prometheusremotewriteexporter] reduce allocations in createAttributes #35184
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[prometheusremotewriteexporter] reduce allocations in createAttributes #35184
Conversation
6a5d2ca
to
c4d233a
Compare
// best to keep it around for the lifetime of the Go process. Due to this shared | ||
// state, PrometheusConverter is NOT thread-safe and is only intended to be used by | ||
// a single go-routine at a time. | ||
// Each FromMetrics call should be followed by a Reset when the metrics can be safely |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we emit a warning log or something if someone calls FromMetrics without Resetting?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moved the reset to always be called inside FromMetrics so this is no longer a user concern.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now that the user doesn't need to call reset, should we remove this part of the comment?
// Each FromMetrics call should be followed by a Reset.....
We don't plan to keep this forever, right? Ideally we'll be able to shard this to improve throughput, we're just hardcoding this to 1 because OTel's exporter helper doesn't ensure ordering. On the other hand, I agree that we shouldn't block optimizations based on something we want to do in the future 😬. @edma2, knowing that we'll eventually shard the output, any suggestions on how to do this without sacrificing your optimization? |
I wonder also since you can have multiple pipelines with multiple remote write exporters (i.e. sending data from dev cluster to 2 destination, dev and prod) if that would break this too. |
@ArthurSens my initial thought here is maybe wrap things in a
@jmichalek132 Each exporter would have its own instance of |
This PR was marked stale due to lack of activity. It will be closed in 14 days. |
This PR was marked stale due to lack of activity. It will be closed in 14 days. |
@ArthurSens @dashpole @jmichalek132 I addressed comments and also changed the implementation so it's now in a sync.Pool. This now supports concurrent access from the exporter class in case it ever supports more than 1 worker at a time. Please take a look! |
b337721
to
6f941e3
Compare
Awesome edma! I'm struggling a bit to find time to review this one, just wanted to let you know that this is on my list :) |
open-telemetry#57) createAttributes was allocating a new label slice for every series, which generates mucho garbage (~30-40% of all allocations). Keep around a re-usable underlying array of labels to reduce allocations on the hot path.
This PR was marked stale due to lack of activity. It will be closed in 14 days. |
Closed as inactive. Feel free to reopen if this PR is still being worked on. |
@atoulme I've updated my branch and merged with |
Hi @edma2 , thanks for continuously coming back to this work! I've been struggling to find time to review this PR again, I just wanted to let you know that I'm aware of its existence and that I'm trying to allocate time for the review! |
I see one more conflict on the code - please resolve and mark ready for review again. |
@edma2, to be fully transparent and respectful of your time, let's put the work here on hold for a while. The optimization you're doing is excellent, but it's also not easy to understand without paying a decent amount of attention. The team's priority right now is adhering to the OTel->Prometheus specification and implementing version 2 of the Remote Write Protocol, which is creating many conflicts in your PR. I'm a bit concerned that what I'm saying will put you in a bad mood, and I apologize for that, but I think it would be worse if I don't say anything and let you keep resolving Merge conflicts every single week 😕 |
@ArthurSens no problem, I totally understand if there are higher priorities right now. Thanks for giving me a heads up. Do you know when would be a good time to revisit the PR? Also, if splitting the PR into smaller pieces would make reviewing it easier, I can do that. |
I'd say that after we solve #33661, it should be a good time to get back to this. We shouldn't see multiple merge conflicts after that |
This PR was marked stale due to lack of activity. It will be closed in 14 days. |
Closed as inactive. Feel free to reopen if this PR is still being worked on. |
Description:
While profiling the collector, we found that the
createAttributes
function was responsible for a significant chunk of allocations (30-40%) which was leading to a high CPU usage spent in GC.createAttributes
is responsible for converting attributes of a given data point to Prometheus labels. For simplicity, it allocates a new labels slice for every data point. We found that reducing allocations here significantly reduced GC time in our environment (in some deployments as much as ~50%).The strategy in this PR is to reuse the slice array as much as possible. The backing array will automatically resize as needed (batching with a batch processor will effectively set an upper bound). Note: we don't need to synchronize access to this (e.g.
sync.Pool
) since the exporter is configured with 1 consumer.Link to tracking Issue:
Testing:
Modified unit tests and ran benchmarks locally.
Works in our production environment.
benchstat output
Documentation: