You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Similar to #242, we want to exfiltrate the node load average as a proxy for the number of runnable jobs, to give an idea of imbalance between allocated resources and available work.
For a Slurm job we really want the number of runnable jobs / the load average for the job, in the context of its allocation. This may be very hard in the context of Sonar, but it may be easier if we use a slurm-centric extractor, see #187. This is all TBD and very speculative.
The text was updated successfully, but these errors were encountered:
It would appear that the load average is a tricky number. First, it is not isolated to the cgroup but it's system-wide. Second, what happens inside a cgroup can affect the system-wide reading: if the cgroup is underprovisioned so that what happens in the cgroup has a high load average, this high load average becomes visible to everyone. So on a node that is allocated exclusively to a cgroup (Betzy at least) the load average is probably a pretty good indicator; on a node that is allocated shared (Fox at least) it is potentially not.
Which is to say that I think we should just collect the load average at every sampling time and exfiltrate it, and it will be one of many signals we can examine. If we also collect the number of threads #242 then together with the number of users and processes and jobs on the node we'll have a lot of data to help us understand what happened. It won't necessarily be easy to write a simple predictable query for the data, but for interactive examination in eg a support case they will be useful.
(To be fleshed out)
Similar to #242, we want to exfiltrate the node load average as a proxy for the number of runnable jobs, to give an idea of imbalance between allocated resources and available work.
For a Slurm job we really want the number of runnable jobs / the load average for the job, in the context of its allocation. This may be very hard in the context of Sonar, but it may be easier if we use a slurm-centric extractor, see #187. This is all TBD and very speculative.
The text was updated successfully, but these errors were encountered: