Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Quality-of-life improvements for hosts of QCFractal manager instances #261

Open
j-wags opened this issue May 18, 2019 · 4 comments
Open
Labels
Enhancement Project enhancement discussion

Comments

@j-wags
Copy link

j-wags commented May 18, 2019

Is your feature request related to a problem? Please describe.

QCFractal managers work well on a technical level, but it's tough to keep tabs on whether I've set the max resources too low, or if I'm being a bad "cluster citizen" by gumming up the free queue. Right now, I run managers on resources that I don't regularly use or monitor, and I don't know how much CPU time their workers are actually taking. I don't see any problem with using thousands of cores on a quiet Sunday night, but I want to make sure I'm not spamming jobs in the way of someone on a Wednesday. HPC staff have been dodgy about recommending resource levels, so I'm in this weird situation where I just guess at max resource values and am never really sure about the effect.

To be clear -- this is a human, not technical, concern (and some clusters do dynamically change priority to handle this sort of thing). But the users hosting QCFractal manager instances are responsible for the resources they end up using, so it would be good to give them a way to understand their resource burden (or, vice versa, their under-utilization).

Describe the solution you'd like
I think a few features could help with this:

  • The option to have the manager write out the number of core-hours it's used at some interval.
  • The option to query the central database to get the same data as above (maybe by summing job walltimes if those are recorded)
  • A way to email the maintainer when a manager dies.
  • This would be technically tough, and I can't do more than handwave. But it would be good to have some heuristic measure of "queue friction". This would have something to do with % queue utilization, the fraction of jobs in the queue that were submitted by QCFractal vs. other users, and/or the total hours spent by other users' jobs waiting in queue while QCFractal ran.

Describe alternatives you've considered

The integral-of-resource-usage could be handled with a script that runs qstat (or equivalent) at an interval, parses the text, and performs some sort of rough integral of cores x time.

The email-when-a-manager-dies idea could be done using a bash script.

@dgasmith
Copy link
Contributor

Quality of life feedback on the managers is always great to hear! I think we can accomplish some of the items that you pointed out:

  • We can show (approximate) number of core-hours consumed and the number of currently used nodes/cores.
  • Querying the central database for this information is do-able, but will take a bit to slowly shake out the permissions. We can certainly look into that.
  • Emailing on shutdown is doable, but potentially complex. Something we can look into-- how high of a priority would you rate this?
  • queue fraction likely isn't doable in a generic fashion. QCFractal is explicitly not going to take on dealing with the queue and leave this up to the workflow managers whose scope this kind of project falls under. This is a request that we can pass up the chain.

One thing to keep in mind is that HPC queues aim to give all users a fair shake. Putting this onus on the HPC staff isn't an unreasonable affair. Most of the HPC staff that we talk to point out that it isn't really a users job to worry about ensuring that they are consuming the supercomputing time in a fair way as long as reasonable practices are followed and the user isn't deliberately trying to abuse the system. The distributed queue managers do not use any abuse practices, and we often obtain feedback that filling up a low or interruptible queue is beneficial to the HPC center itself.

If you have experiences with HPC staff that is to the contrary to the above, it would be good to hear!

@j-wags
Copy link
Author

j-wags commented May 22, 2019

We can show (approximate) number of core-hours consumed and the number of currently used nodes/cores.

I think this by itself would be very helpful.

Querying the central database for this information is do-able, but will take a bit to slowly shake out the permissions. We can certainly look into that.

I mostly proposed this in the unlikely event that the first option wasn't possible. It looks like this probably isn't necessary.

Emailing on shutdown is doable, but potentially complex. Something we can look into-- how high of a priority would you rate this?

This is not incredibly urgent, and I could homebrew a solution to this in the meantime. We could revisit the idea in a few months.

queue fraction likely isn't doable in a generic fashion. QCFractal is explicitly not going to take on dealing with the queue and leave this up to the workflow managers whose scope this kind of project falls under. This is a request that we can pass up the chain.

That makes sense. Thanks for the feedback!

@dgasmith
Copy link
Contributor

We raised the first item in #262. We might be able to get this into the 0.7.0 release this week.

Re emailing, it turns out this is possible with built in libraries. I will make an issue for this.

@Lnaden
Copy link
Collaborator

Lnaden commented Jul 10, 2019

Number 1, 2, and Some of 4 should have been put in by #313. I'm going to leave this open because I still think getting 4 in would be good.

@dgasmith dgasmith added the Enhancement Project enhancement discussion label Oct 17, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement Project enhancement discussion
Projects
None yet
Development

No branches or pull requests

3 participants