Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding logs to Tron to indicate start and scheduling times - TRON-2152 #948

Merged
merged 6 commits into from
Apr 8, 2024

Conversation

EmanElsaban
Copy link
Contributor

This PR is for the tickets: TRON-2153 and TRON-2152

Copy link
Member

@nemacysts nemacysts left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ty for adding more comments for how things work!

the bulk of my comments are generally just tweaking wording (and can be ignored if you'd like) - there's only a couple that we should definitely dig into (e.g., logging when the first k8s schedule happens as well as some of the time logs for functions that get called both for startup as well as for new config deployment)

bin/trond Outdated Show resolved Hide resolved
tron/trondaemon.py Outdated Show resolved Hide resolved
tron/trondaemon.py Outdated Show resolved Hide resolved
tron/trondaemon.py Outdated Show resolved Hide resolved
tron/trondaemon.py Outdated Show resolved Hide resolved
tron/mcp.py Outdated Show resolved Hide resolved
tron/mcp.py Outdated Show resolved Hide resolved
@@ -1134,6 +1135,8 @@ def handle_action_command_state_change(self, action_command, event, event_data=N
class KubernetesActionRun(ActionRun, Observer):
"""An ActionRun that executes the command on a Kubernetes cluster."""

start_schedule_jobs = True
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a little iffy on adding a boolean here - can we maybe add something to where we initialize the task_proc client instead?

tron/core/job_collection.py Outdated Show resolved Hide resolved
tron/core/job_collection.py Outdated Show resolved Hide resolved
@EmanElsaban EmanElsaban force-pushed the u/emanelsabban/TRON-2153-and-TRON-2152 branch from b6e4761 to 0999dbe Compare April 4, 2024 12:24
@EmanElsaban EmanElsaban requested a review from nemacysts April 4, 2024 19:13
Comment on lines 289 to 290
# This flag will be used to indicate in logs the first task we start scheduling and running on Kubernetes
self.start_schedule_jobs = start_schedule_jobs
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think we can probably skip this logic and say that we can start scheduling jobs after self.connect() has been called further below :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm i wonder if we should add that log message after self.connect to indicate that we connected to task_proc and to also keep the current log produced after the flag "start_schedule_jobs". Because we do all the connection stuff under self._run_mcp(boot_time=boot_time) meanwhile we actually start submitting tasks to k8s when we are waiting after we run the reactor, at this point in the code

            while True:
                signum = signal.sigwait(list(signal_map.keys()))

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the log message from the start_schedule_jobs flag will probably be somewhat unreliable if we succeed in lowering the restart time as it will depend on job schedules

that said: the task_proc stuff does not run on the twisted reactor, so it should be possible for it to run beforehand

(although, I guess that until the reactor is running there won't be any new tasks submitted to task_proc)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in general, i'm a little hesitant to add a state-indicating boolean just for logging purposes, so perhaps we can try to use the tron/trondaemon.py logging as a proxy timing metric instead?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, they both happen after the reactor runs under submit_command function. I wonder if we should still keep both though cause there are some function calls between them (i.e time spent in other functions)

self.boot_time = time.time()
log.info("initialized")
self.boot_time = boot_time
current_time = time.strftime("%a, %d %b %Y %H:%M:%S", time.localtime(boot_time))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just double-checking: is time.localtime(None) valid?

Copy link
Contributor Author

@EmanElsaban EmanElsaban Apr 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yup, it prints

2024-04-05 05:03:11,461 tron.trondaemon INFO Starting setup processes...
2024-04-05 05:03:19,723 tron.mcp INFO Initialized. Tron started on Fri, 05 Apr 2024 05:03:16!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think if we don't pass anything to localtime() or pass None , it converts the current time instead

tron/mcp.py Outdated
states = self.state_watcher.restore(self.jobs.get_names())
MesosClusterRepository.restore_state(states.get("mesos_state", {}))

self.jobs.restore_state(states.get("job_state", {}), action_runner)
self.jobs.restore_state(states.get("job_state", {}), action_runner, boot_time=self.boot_time)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

on a second look through this diff: do we need to pass through boot_time here?

could we print the same log statement after restore_state() has been called here (or right before)?

it would result in a smaller, more targeted diff :)

@EmanElsaban EmanElsaban requested a review from nemacysts April 5, 2024 18:52
Copy link
Member

@nemacysts nemacysts left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

generally lgtm - i can ship once we resolve the optional boot_time comment

(the other two are stylistic nits)

tron/mcp.py Outdated
states = self.state_watcher.restore(self.jobs.get_names())
MesosClusterRepository.restore_state(states.get("mesos_state", {}))

log.info(
f"Tron will start restoring state for the jobs and will start scheduling them! Time elapsed since Tron started {time.time()- self.boot_time}"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

huh, i just moved this to my landscape-oriented monitor and noticed that black didn't format the f-string

Suggested change
f"Tron will start restoring state for the jobs and will start scheduling them! Time elapsed since Tron started {time.time()- self.boot_time}"
f"Tron will start restoring state for the jobs and will start scheduling them! Time elapsed since Tron started {time.time() - self.boot_time}"

tron/mcp.py Outdated
self.jobs.restore_state(states.get("job_state", {}), action_runner)
log.info(
f"Tron completed restoring state for the jobs. Time elapsed since Tron started {time.time()- self.boot_time}"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same nit as above

Suggested change
f"Tron completed restoring state for the jobs. Time elapsed since Tron started {time.time()- self.boot_time}"
f"Tron completed restoring state for the jobs. Time elapsed since Tron started {time.time() - self.boot_time}"

@@ -75,7 +75,7 @@ def __init__(self, options):
self.signals = {signal.SIGINT: signal.default_int_handler}
self.manhole_sock = f"{self.options.working_dir}/manhole.sock"

def run(self):
def run(self, boot_time=None):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(chatting about this internally: but on closer inspection, i think we want to make sure that this is set or add some guard clauses to the log statements below - otherwise, we're leaving a bit of a footgun around)

@EmanElsaban EmanElsaban requested a review from nemacysts April 5, 2024 21:03
@@ -136,7 +136,7 @@ def _consume_save_queue(self):
log.error(error)
with self.save_lock:
self.save_queue[key] = val
duration = time.time() - start
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

was this intentional?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nopee, will commit the past version

@EmanElsaban EmanElsaban requested a review from nemacysts April 8, 2024 18:00
@EmanElsaban EmanElsaban merged commit 44e71db into master Apr 8, 2024
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants