Load averages #249

lars-t-hansen · 2025-02-05T08:32:12Z

(To be fleshed out)

Similar to #242, we want to exfiltrate the node load average as a proxy for the number of runnable jobs, to give an idea of imbalance between allocated resources and available work.

For a Slurm job we really want the number of runnable jobs / the load average for the job, in the context of its allocation. This may be very hard in the context of Sonar, but it may be easier if we use a slurm-centric extractor, see #187. This is all TBD and very speculative.

lars-t-hansen · 2025-02-07T08:19:47Z

https://stackoverflow.com/questions/440756/how-to-find-the-processor-queue-length-in-linux
https://www.scoutapm.com/understanding-load-averages/

Similarly disk queue length: https://stackoverflow.com/questions/76653015/i-can-not-calculate-length-of-disk-queue-in-linux

Litlle's law gets mentioned a lot.

lars-t-hansen · 2025-02-07T14:51:48Z

It would appear that the load average is a tricky number. First, it is not isolated to the cgroup but it's system-wide. Second, what happens inside a cgroup can affect the system-wide reading: if the cgroup is underprovisioned so that what happens in the cgroup has a high load average, this high load average becomes visible to everyone. So on a node that is allocated exclusively to a cgroup (Betzy at least) the load average is probably a pretty good indicator; on a node that is allocated shared (Fox at least) it is potentially not.

lars-t-hansen · 2025-02-07T15:11:03Z

Which is to say that I think we should just collect the load average at every sampling time and exfiltrate it, and it will be one of many signals we can examine. If we also collect the number of threads #242 then together with the number of users and processes and jobs on the node we'll have a lot of data to help us understand what happened. It won't necessarily be easy to write a simple predictable query for the data, but for interactive examination in eg a support case they will be useful.

lars-t-hansen added enhancement New feature or request question Further information is requested Logging labels Feb 5, 2025

lars-t-hansen mentioned this issue Feb 7, 2025

Log disk I/O #135

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Load averages #249

Load averages #249

lars-t-hansen commented Feb 5, 2025 •

edited

Loading

lars-t-hansen commented Feb 7, 2025 •

edited

Loading

lars-t-hansen commented Feb 7, 2025

lars-t-hansen commented Feb 7, 2025

Load averages #249

Load averages #249

Comments

lars-t-hansen commented Feb 5, 2025 • edited Loading

lars-t-hansen commented Feb 7, 2025 • edited Loading

lars-t-hansen commented Feb 7, 2025

lars-t-hansen commented Feb 7, 2025

lars-t-hansen commented Feb 5, 2025 •

edited

Loading

lars-t-hansen commented Feb 7, 2025 •

edited

Loading