-
Notifications
You must be signed in to change notification settings - Fork 48
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Quality-of-life improvements for hosts of QCFractal manager instances #261
Comments
Quality of life feedback on the managers is always great to hear! I think we can accomplish some of the items that you pointed out:
One thing to keep in mind is that HPC queues aim to give all users a fair shake. Putting this onus on the HPC staff isn't an unreasonable affair. Most of the HPC staff that we talk to point out that it isn't really a users job to worry about ensuring that they are consuming the supercomputing time in a fair way as long as reasonable practices are followed and the user isn't deliberately trying to abuse the system. The distributed queue managers do not use any abuse practices, and we often obtain feedback that filling up a low or interruptible queue is beneficial to the HPC center itself. If you have experiences with HPC staff that is to the contrary to the above, it would be good to hear! |
I think this by itself would be very helpful.
I mostly proposed this in the unlikely event that the first option wasn't possible. It looks like this probably isn't necessary.
This is not incredibly urgent, and I could homebrew a solution to this in the meantime. We could revisit the idea in a few months.
That makes sense. Thanks for the feedback! |
We raised the first item in #262. We might be able to get this into the 0.7.0 release this week. Re emailing, it turns out this is possible with built in libraries. I will make an issue for this. |
Number 1, 2, and Some of 4 should have been put in by #313. I'm going to leave this open because I still think getting 4 in would be good. |
Is your feature request related to a problem? Please describe.
QCFractal managers work well on a technical level, but it's tough to keep tabs on whether I've set the max resources too low, or if I'm being a bad "cluster citizen" by gumming up the free queue. Right now, I run managers on resources that I don't regularly use or monitor, and I don't know how much CPU time their workers are actually taking. I don't see any problem with using thousands of cores on a quiet Sunday night, but I want to make sure I'm not spamming jobs in the way of someone on a Wednesday. HPC staff have been dodgy about recommending resource levels, so I'm in this weird situation where I just guess at max resource values and am never really sure about the effect.
To be clear -- this is a human, not technical, concern (and some clusters do dynamically change priority to handle this sort of thing). But the users hosting QCFractal manager instances are responsible for the resources they end up using, so it would be good to give them a way to understand their resource burden (or, vice versa, their under-utilization).
Describe the solution you'd like
I think a few features could help with this:
Describe alternatives you've considered
The integral-of-resource-usage could be handled with a script that runs qstat (or equivalent) at an interval, parses the text, and performs some sort of rough integral of cores x time.
The email-when-a-manager-dies idea could be done using a bash script.
The text was updated successfully, but these errors were encountered: