Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Load averages #249

Open
lars-t-hansen opened this issue Feb 5, 2025 · 3 comments
Open

Load averages #249

lars-t-hansen opened this issue Feb 5, 2025 · 3 comments
Labels
enhancement New feature or request Logging question Further information is requested

Comments

@lars-t-hansen
Copy link
Collaborator

lars-t-hansen commented Feb 5, 2025

(To be fleshed out)

Similar to #242, we want to exfiltrate the node load average as a proxy for the number of runnable jobs, to give an idea of imbalance between allocated resources and available work.

For a Slurm job we really want the number of runnable jobs / the load average for the job, in the context of its allocation. This may be very hard in the context of Sonar, but it may be easier if we use a slurm-centric extractor, see #187. This is all TBD and very speculative.

@lars-t-hansen lars-t-hansen added enhancement New feature or request question Further information is requested Logging labels Feb 5, 2025
@lars-t-hansen
Copy link
Collaborator Author

It would appear that the load average is a tricky number. First, it is not isolated to the cgroup but it's system-wide. Second, what happens inside a cgroup can affect the system-wide reading: if the cgroup is underprovisioned so that what happens in the cgroup has a high load average, this high load average becomes visible to everyone. So on a node that is allocated exclusively to a cgroup (Betzy at least) the load average is probably a pretty good indicator; on a node that is allocated shared (Fox at least) it is potentially not.

@lars-t-hansen
Copy link
Collaborator Author

Which is to say that I think we should just collect the load average at every sampling time and exfiltrate it, and it will be one of many signals we can examine. If we also collect the number of threads #242 then together with the number of users and processes and jobs on the node we'll have a lot of data to help us understand what happened. It won't necessarily be easy to write a simple predictable query for the data, but for interactive examination in eg a support case they will be useful.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request Logging question Further information is requested
Projects
None yet
Development

No branches or pull requests

1 participant