Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Event loop lag metric (gauge) does NOT measure average event lag #561

Open
shappir opened this issue Mar 27, 2023 · 2 comments
Open

Event loop lag metric (gauge) does NOT measure average event lag #561

shappir opened this issue Mar 27, 2023 · 2 comments

Comments

@shappir
Copy link
Contributor

shappir commented Mar 27, 2023

The built-in metric (gauge) nodejs_eventloop_lag_seconds is marked as measuring the average event loop lag: the aggregator property value for it is average.

However, what it actually reports is the amount of time required to generate and send the previous metrics response.

This is equivalent to the maximum event loop lag value rather than the average. To get the actual average you need to use nodejs_eventloop_lag_mean_seconds instead (for Node servers that support it). In our case the average event loop lag is under 11ms while the nodejs_eventloop_lag_seconds gets as high as 1 second !

I recommend:

  1. Fixing nodejs_eventloop_lag_seconds to actually report the average event loop lag
  2. Introduce a new metric to report the metrics response generation time, say nodejs_metrics_response_time

Important: the amount of time required to generate the metrics response remains high, and needs to be improved.

@shappir
Copy link
Contributor Author

shappir commented Mar 27, 2023

This is how nodejs_eventloop_lag_seconds is currently computed:

  1. In eventLoopLag.js a Gauge is created with that name and a collect method
  2. That collect method is invoked whenever the metric's values are requested
  3. The method measures the high-res time when it is invoked. Since it's a built-in metric, it will be one of the first, and so records the time at which metrics collection starts
  4. It then uses setImmediate to schedule another measurement. But since that measurement happens in the next iteration of the event loop, it will take place after the metrics are all collected and reported
  5. Since the collect method doesn't return a promise, the value it measures will only be reported in the next metrics invocation.

Since the delta is from when metrics starts until after the response is sent, this is what it measures. And it's not updated until the next invocation of metrics.

@AntunVukicevic
Copy link

I just spent hours figuring out why this metric deviates so much from the rest of event loop metrics. Thank you for clarification @shappir 👏🏻

It would be really nice if this is fixed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants