-
Notifications
You must be signed in to change notification settings - Fork 63
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding logs to Tron to indicate start and scheduling times - TRON-2152 #948
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ty for adding more comments for how things work!
the bulk of my comments are generally just tweaking wording (and can be ignored if you'd like) - there's only a couple that we should definitely dig into (e.g., logging when the first k8s schedule happens as well as some of the time logs for functions that get called both for startup as well as for new config deployment)
tron/core/actionrun.py
Outdated
@@ -1134,6 +1135,8 @@ def handle_action_command_state_change(self, action_command, event, event_data=N | |||
class KubernetesActionRun(ActionRun, Observer): | |||
"""An ActionRun that executes the command on a Kubernetes cluster.""" | |||
|
|||
start_schedule_jobs = True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a little iffy on adding a boolean here - can we maybe add something to where we initialize the task_proc client instead?
b6e4761
to
0999dbe
Compare
tron/kubernetes.py
Outdated
# This flag will be used to indicate in logs the first task we start scheduling and running on Kubernetes | ||
self.start_schedule_jobs = start_schedule_jobs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think we can probably skip this logic and say that we can start scheduling jobs after self.connect()
has been called further below :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm i wonder if we should add that log message after self.connect to indicate that we connected to task_proc and to also keep the current log produced after the flag "start_schedule_jobs". Because we do all the connection stuff under self._run_mcp(boot_time=boot_time)
meanwhile we actually start submitting tasks to k8s when we are waiting after we run the reactor, at this point in the code
while True:
signum = signal.sigwait(list(signal_map.keys()))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the log message from the start_schedule_jobs
flag will probably be somewhat unreliable if we succeed in lowering the restart time as it will depend on job schedules
that said: the task_proc stuff does not run on the twisted reactor, so it should be possible for it to run beforehand
(although, I guess that until the reactor is running there won't be any new tasks submitted to task_proc)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in general, i'm a little hesitant to add a state-indicating boolean just for logging purposes, so perhaps we can try to use the tron/trondaemon.py logging as a proxy timing metric instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, they both happen after the reactor runs under submit_command
function. I wonder if we should still keep both though cause there are some function calls between them (i.e time spent in other functions)
self.boot_time = time.time() | ||
log.info("initialized") | ||
self.boot_time = boot_time | ||
current_time = time.strftime("%a, %d %b %Y %H:%M:%S", time.localtime(boot_time)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just double-checking: is time.localtime(None)
valid?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yup, it prints
2024-04-05 05:03:11,461 tron.trondaemon INFO Starting setup processes...
2024-04-05 05:03:19,723 tron.mcp INFO Initialized. Tron started on Fri, 05 Apr 2024 05:03:16!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think if we don't pass anything to localtime() or pass None , it converts the current time instead
tron/mcp.py
Outdated
states = self.state_watcher.restore(self.jobs.get_names()) | ||
MesosClusterRepository.restore_state(states.get("mesos_state", {})) | ||
|
||
self.jobs.restore_state(states.get("job_state", {}), action_runner) | ||
self.jobs.restore_state(states.get("job_state", {}), action_runner, boot_time=self.boot_time) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
on a second look through this diff: do we need to pass through boot_time here?
could we print the same log statement after restore_state()
has been called here (or right before)?
it would result in a smaller, more targeted diff :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
generally lgtm - i can ship once we resolve the optional boot_time comment
(the other two are stylistic nits)
tron/mcp.py
Outdated
states = self.state_watcher.restore(self.jobs.get_names()) | ||
MesosClusterRepository.restore_state(states.get("mesos_state", {})) | ||
|
||
log.info( | ||
f"Tron will start restoring state for the jobs and will start scheduling them! Time elapsed since Tron started {time.time()- self.boot_time}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
huh, i just moved this to my landscape-oriented monitor and noticed that black
didn't format the f-string
f"Tron will start restoring state for the jobs and will start scheduling them! Time elapsed since Tron started {time.time()- self.boot_time}" | |
f"Tron will start restoring state for the jobs and will start scheduling them! Time elapsed since Tron started {time.time() - self.boot_time}" |
tron/mcp.py
Outdated
self.jobs.restore_state(states.get("job_state", {}), action_runner) | ||
log.info( | ||
f"Tron completed restoring state for the jobs. Time elapsed since Tron started {time.time()- self.boot_time}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same nit as above
f"Tron completed restoring state for the jobs. Time elapsed since Tron started {time.time()- self.boot_time}" | |
f"Tron completed restoring state for the jobs. Time elapsed since Tron started {time.time() - self.boot_time}" |
tron/trondaemon.py
Outdated
@@ -75,7 +75,7 @@ def __init__(self, options): | |||
self.signals = {signal.SIGINT: signal.default_int_handler} | |||
self.manhole_sock = f"{self.options.working_dir}/manhole.sock" | |||
|
|||
def run(self): | |||
def run(self, boot_time=None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(chatting about this internally: but on closer inspection, i think we want to make sure that this is set or add some guard clauses to the log statements below - otherwise, we're leaving a bit of a footgun around)
@@ -136,7 +136,7 @@ def _consume_save_queue(self): | |||
log.error(error) | |||
with self.save_lock: | |||
self.save_queue[key] = val | |||
duration = time.time() - start |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
was this intentional?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nopee, will commit the past version
This PR is for the tickets: TRON-2153 and TRON-2152