-
Notifications
You must be signed in to change notification settings - Fork 48
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Manager Quality of Life statistics #313
Conversation
qcfractal/queue/managers.py
Outdated
now = self.statistics.last_update_time = time.time() | ||
time_delta_seconds = now - last_time | ||
try: | ||
running_tasks = self.queue_adapter.count_running() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because counting the running tasks is a better gauge of how many resources are being consumed. If we only count workers, then we don't take into account that each worker can have multiple tasks running on it. I did try to work it out such that cores cannot be oversubscribed, but I don't know how effective that is.
qcfractal/queue/managers.py
Outdated
if result.success: | ||
n_success += 1 | ||
try: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would recommend if hasattr(result.provenance 'wall_time')
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a cleaner construct, I'll change that
qcfractal/queue/managers.py
Outdated
wall_time_seconds = result.input_data['provenance'].get('wall_time') | ||
except: | ||
pass | ||
try: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why a try here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In case for whatever reason the wall_time
variable does not get filled in with an actual float. I'm trying to preempt getting something like a None
here, or having a job crash because the cluster job died and something mangled came back. This is just a safety catch which won't bring down the whole manager if something goes very wrong.
Then again, we may want it to do exactly that to better catch actual bugs. Thoughts?
qcfractal/queue/managers.py
Outdated
f"{self.statistics.total_failed_tasks} failed " | ||
f"({self.statistics.total_successful_tasks/self.statistics.total_completed_tasks*100}% " | ||
f"success rate)") | ||
self.logger.info(f"Stats: {self.statistics.total_cpu_hours} CPU Hours logged successfully (estimate)") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we try to make this a single line?
(processed=5, failed=1, success=80%, core_hours=53.3)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. I like that better
qcfractal/queue/managers.py
Outdated
self.logger.info(f"Stats: {self.statistics.total_cpu_hours} CPU Hours logged successfully (estimate)") | ||
# Handle efficiency calculations | ||
if log_efficiency: | ||
efficiency = min((max_cpu_hours_running + task_cpu_hours)/max_cpu_hours_possible * 100, 100) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Be very careful here, this isn't cpu hours but core hours.
qcfractal/queue/base_adapter.py
Outdated
@@ -199,6 +199,23 @@ def close(self) -> bool: | |||
True if the closing was successful. | |||
""" | |||
|
|||
def count_running(self) -> int: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we do leave this, perhaps count_running_tasks
to be explicit.
Added quality of life stats to the Queue Manager reporting, only really works in parsl for efficiency, but will track totals as well. Progress towards (and possible resolution) of MolSSI#262
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall LGTM, I think this will be really nice! Is there any way you can snapshot a log to give an idea of what this looks like in practice?
qcfractal/cli/qcfractal_manager.py
Outdated
if settings.manager.max_queued_tasks is None: | ||
# Tasks * jobs * buffer + 1 | ||
max_queued_tasks = ceil(settings.common.tasks_per_worker * settings.common.max_workers * 1.20) + 1 | ||
max_queued_tasks = ceil(max_concurrent_tasks * 1.20) + 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let us go ahead and bump this to 2x.
@@ -78,6 +82,10 @@ def _submit_task(self, task_spec: Dict[str, Any]) -> Tuple[Hashable, Any]: | |||
func, *task_spec["spec"]["args"], **task_spec["spec"]["kwargs"], resources={"process": 1}) | |||
return task_spec["id"], task | |||
|
|||
def count_running_workers(self) -> int: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should be able to call this function.
running_workers = 0 | ||
log_efficiency = False | ||
max_core_hours_running = time_delta_seconds * running_workers * self.statistics.cores_per_task / 3600 | ||
max_core_hours_possible = (time_delta_seconds * self.statistics.max_concurrent_tasks |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Im not sure max_core_hours_possible
is too interesting here? This number focuses on your own queue and how active you were. I guess for low queue its possibly nice.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd like to leave this one in because it gives an upfront estimate of how well we are utilizing the resources you requested over the resources you were allocated. If, say, someone fires this up and they only get half the nodes they requested, they can look into that. Or, for instance in case of Dask sometimes, users request a bunch of resources they know are available and could be allocated, but then the adapter only spins up a fraction of them, it would be indicative.
I could also just mask this behind a verbosity level since it could get distracting, especially in the case of drained queues, or remove it outright, but I think there are valid reasons for tracking it at least.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fair enough, let’s leave it in for now. I think it might be too much standard info, but easier to remove or move to a debug logging than to add back later.
qcfractal/queue/managers.py
Outdated
self.statistics.total_failed_tasks += n_fail | ||
self.statistics.total_successful_tasks += n_success | ||
self.statistics.total_core_hours += task_cpu_hours | ||
try: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we do this via an if
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, I do this twice so I'll switch it, any particular reason or just preference of if...else
over try...except
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It just seems a bit cleaner and more inline with the rest of the code base. I have often found exception catching to be imprecise for these catches.
def count_running_workers(self) -> int: | ||
|
||
running = 0 | ||
executor_running_task_map = {key: False for key in self.client.executors.keys()} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This sounds like something we could ask of Parsl.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Possibly, it would really need to be a way to loop over every executor in the Dataflow client and count the workers in a way which makes sense to the executor. Since we only really allow ThreadPool and HighThroughput executors, I just dealt with those for now.
I could probably make this faster and simpler by simply assuming "Is there a ThreadPoolWorker present? Of so it might as well be working since its effectively running locally." Thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, this is something the parsl team can easily integrate and test without us being in the loop. I think it’s fine for now, but this level of introspection is often easily breakable.
Fixed a bug in Snowflake I found in an error catch Made LGTM ignore the @validator error Found a bug in the recursive directory creation of the server Made the stats output numbers less long floats On a related note: Nested F-string formatting!
The most recent commit made a few other changes which I needed to do while getting this sample of the logfile output from below, I'll want a sign off on some of these since they do touch a few things I was not working on at first (cc @dgasmith for the first 3, I can undo them and make a separate PR if you want for them.):
Sample of the output (with changes from latest commit): Uses a Snowflake and the ProcessPoolExecutor to do short tasks on 1 core, 1 worker. Since it is the Pool executor, the potential and used resources will be the same as its always running.
|
qcfractal/queue/executor_adapter.py
Outdated
@@ -83,8 +83,8 @@ def _submit_task(self, task_spec: Dict[str, Any]) -> Tuple[Hashable, Any]: | |||
return task_spec["id"], task | |||
|
|||
def count_running_workers(self) -> int: | |||
# Have not worked through this yet for Dask | |||
raise NotImplementedError | |||
# Note: This number may not quite be right if its treating "worker" as a "job" or Dask-Distributed "worker" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
self.client._count_active_workers() * self.client.worker_processes
in that case?
I think the status line is too long clocking in at 211 chars, what about:
Still clocks in at ~120 chars. Should probably try to shorten slightly more. For >>> val = 300000012340000.12341
>>> f'{val:,.2f}'
'300,000,012,340,000.12' |
This pull request fixes 1 alert when merging 34eb760 into f5e0e0c - view on LGTM.com fixed alerts:
|
The log now looks like this:
In verbose mode, it includes an additional field
Because of rounding, the Core Hours used don't report until above 0.01 hours. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
Added quality of life stats to the Queue Manager reporting,
only really works in parsl for efficiency, but will track totals
as well.
Progress towards (and possible resolution) of #262
Still a work in progress, but I wanted to open this anyways. Getting stats on running jobs turns out to be kind of a large pain.
Status