-
Notifications
You must be signed in to change notification settings - Fork 5.5k
fix(native): Fix OS metrics to report cumulative values for AVG type #26517
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
Reviewer's guide (collapsed on small PRs)Reviewer's GuideThis PR converts six OS-related metrics from reporting delta values to reporting cumulative values by eliminating subtraction of previous readings and removing obsolete state variables used for delta calculations. Class diagram for updated PeriodicTaskManager OS metrics logicclassDiagram
class PeriodicTaskManager {
-lastHttpClientNumConnectionsCreated_: int64_t
+updateOperatingSystemStats()
+addOperatingSystemStatsUpdateTask()
}
%% Removed attributes for OS metric deltas
%% lastUserCpuTimeUs_, lastSystemCpuTimeUs_, lastSoftPageFaults_, lastHardPageFaults_, lastVoluntaryContextSwitches_, lastForcedContextSwitches_ are no longer present
Flow diagram for OS metrics reporting change (delta to cumulative)flowchart TD
A["Collect OS metric (e.g., user CPU time)"] --> B["Report cumulative value since process start"]
B --> C["RECORD_METRIC_VALUE(metric, cumulative_value)"]
%% Previously: A --> D["Subtract previous value (delta)"] --> C
%% Now: direct cumulative reporting
File-Level Changes
Assessment against linked issues
Possibly linked issues
Tips and commandsInteracting with Sourcery
Customizing Your ExperienceAccess your dashboard to:
Getting Help
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
|
@majetideepak Could you help review this PR? Thanks. |
|
@majetideepak @karteekmurthys @aditi-pandit Kindly ping. Could you please review this PR? This issue affects the accuracy of Prometheus monitoring metrics. |
|
I think we should also change the metric type to SUM? In my understanding AVG type should just report the current value. |
@jaystarshot Thanks for your reply. We might first need to clarify the meaning of "current value" here: should it be a "current delta value" or "current cumulative value"? When we say "delta value" for metrics, it implicitly implies a "time span," such as a 5-second span or a 30-second span. It seems only the "cumulative value" corresponds to a "point in time," that is, the "current value" at a specific point in time. For OS-counters, the Regarding whether to change it to a "SUM type"(SUM type reports "delta value"), I believe it can also resolve the data loss issue mentioned in the issue (#26516, because I previously submitted a PR #23622 to fix SUM type metric reporting, the reported value will be accumulated in PrometheusReporter). However, because the |
|
@lingbin The root cause of all this is that the velox metric type AVERAGE is unclear and there is no documentation on what that should represent. In our production we use metric type SUM and use (persecond) and other differential to get accurate view in grapahana directly. Regarding this change if getrusage() function returns the "current cumulative value" (not the "delta") accumulated then for our prometheus reporting you can keep this change but be aware that it only reports the last value received. i.e if the puller pulls every 5 sec, it will only pick the last value received which i think is acceptable. |
|
@xiaoxmeng @amitkdutta Please can you comment. Its possible these metrics are monitored at Meta since Meng added them originally. |
|
@jaystarshot Thank you for your further explanation.
I'm also looking forward to a specific and consistent explanation of how to use each metric type.
Do you mean that in your production environment code, these |
|
After careful consideration, I've realized that for metrics whose values semantically increase monotonically (like the six OS-related metrics here), in Prometheus's "Pull Model", because the interval for "pulling metrics" differs from the "push interval" of
Perhaps we should document both methods so that developers can choose between them based on their needs? If it's easier to obtain the "cumulative value," then use the AVG type. If it's easier to obtain the "difference," then use the SUM type. The only point to note is that, generally speaking, "calculating the difference" can be a bit tedious because it requires saving the old values. Looking forward to everyone's guidance and suggestions for better practices, especially for usage already in production environments, thanks. |
|
No not this one, but we have changed some of which we do use. For cpu we currently just use our host metric system. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lets get @amitkdutta or @xiaoxmeng approval before submission. Have pinged Amit.
From IBM we are okay with this change. But I would prefer Meta confirm as well.
amitkdutta
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good. Thanks @lingbin
aditi-pandit
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks.
22ec2df to
cdc9c30
Compare
|
Already rebased to re-trigger CI. |
All 6 OS-related metrics were defined as **AVG** type but reported as **delta values**, causing incorrect averaging and potential data loss in Prometheus monitoring. Changed metrics to report **cumulative values** since process start: - presto_cpp.os_user_cpu_time_micros - presto_cpp.os_system_cpu_time_micros - presto_cpp.os_num_soft_page_faults - presto_cpp.os_num_hard_page_faults - presto_cpp.os_num_voluntary_context_switches - presto_cpp.os_num_forced_context_switches This ensures: 1. Alignment with other AVG metrics in the system (task counts, cache sizes, etc.) 2. Proper rate calculations in monitoring systems and no data loss regardless of scraping intervals
cdc9c30 to
d05707b
Compare
Fixes #26516
All 6 OS-related metrics were defined as AVG type but reported as
delta values, causing incorrect averaging and potential data loss
in Prometheus monitoring.
Changed metrics to report cumulative values since process start:
This ensures:
cache sizes, etc.)
regardless of scraping intervals