-
Notifications
You must be signed in to change notification settings - Fork 2.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Disappearing metrics when using prometheus exporter when a histogram metric is received with same name but a different description #36493
Comments
Pinging code owners:
See Adding Labels via Comments if you do not have permissions to add labels yourself. |
How often is the metric being collected? There is an expiration time in the exporter for old points. |
Hi @dashpole Thanks for your response Metrics are scraped every minute. Expiration time on prometheus exporter is set at 10 minutes. However, when I was testing I actually didn't use the prometheus scrape: instead I logged into the host where otel collector is being run, and used curl to localhost to check on the prometheus exporter endpoint direclty like:
I did this curl multiple times on the same minute. When I filter out the metric below, the problem goes away:
I am trying to simulate a payload equal to the one that seems to be causing the issue, if I'm sucessful I will report it here |
Hello @dashpole I did some testing and figured out what is causing the bug. Cause: Same metric being received with different descriptions. Explanation: The auto-instrumentation agent for nodejs version 1.22.0 sends the histogram metric http.server.duration with the description: However, the auto-instrumentation agent for java version 1.34.1 sends the same metric http.server.duration with a different description: When this happens, the Prometheus exporter keeps "switching" back and forth which of the metrics it serves, seemingly at random. How to reproduce Pretty straightforward. Use this
Send a histogram metric with some description:
Then send metric with the same name, but a different description
Finally check the Prometheus exposed endpoint. It will sometimes show the first metric data points, and some times the second metric data points, seemingly at random. Keep checking the endpoint multiple times repeatedly. Eventually you will see that the metric exposed changes, in an inconsistent manner. Ex:
Why is this a problem I believe the prometheus exporter should be resilient enough to handle bad input. I would suggest that when receiving a metric point with same metric name but different description, the component could override the old description, or ignore the new description. Perhaps this could even be configured in the component? Ex:
If you need more details please let me know, I'll help however I can |
I think this might be fixed by #36356. Can you try with v0.114.0 of the collector? |
Thanks, you're right v0.114.0 solves the issue!
|
Component(s)
exporter/prometheus
What happened?
Description
On otel collector contrib 0.111.0 running as a systemd service, when exposing metrics with Prometheus exporter I can see that sometimes some metrics are missing.
For example http_server_duration_milliseconds_count would most times return 44 samples, but some times return 6 samples, even running the test on the same second
I tested this with curl to discard possible scraping errors:
Of course, this causes the scrape to sometimes have missing data seemingly at random, causing "gaps" on data.
After digging around, I isolated the issue to at least one histogram metric. When I filter out this metric, the issue goes away
http.server.duration{service.name=ps-sac-fe}
In other words, it seem this histogram is somehow breaking the prometheus exporter
Steps to Reproduce
This issue happened in a production collector. I'm still not really sure why it's happening, but I exported the metric that seems to be the culprit to debug, and added it to the "Log output" session.
I am not sure what would be wrong with the metric, but when I filter the metric out, the issue does go away
Expected Result
Multiple curl's to localhost/metrics should return the same number of timeseries somewhat consistently
Actual Result
Multiple curl's to localhost/metrics return different number of timeseries: some timeseries are missing seemingly at random, even when the curl is made at the same minute or even same second
If any other test is needed, please let me know
Collector version
v0.111.0
Environment information
Environment
OS: (e.g., "Ubuntu 20.04")
Compiler(if manually compiled): (e.g., "go 14.2")
OpenTelemetry Collector configuration
Log output
Additional context
No response
The text was updated successfully, but these errors were encountered: