Scalable updaters #3257

kepricon · 2021-06-25T21:03:52Z

kepricon
Jun 25, 2021

Hi @slinlee @alexamakarov @xneyder
I'd like to start to discuss Scalable Updaters here.

I think we already have the necessary modules for making scalble from ticket, PR, design and It's a good time to understand the current implementation better and discuss how to make it happen.

for starters, I have a couple of questions about our implementations.

We have a health check endpoint for puctuator and updaters from add health endpoint to all running modules #3034.
Have we used this endpoint from somewhere such as pathmind-admin? @xneyder
under the current circumstances(single punctuator and single updater), if puctuator and updaters are dead or stuck for some reason, users can't get updates from webapp.

In an abnormal situation, puctuator and updaters can be dead
if progress updates went wrong(memory issue or so) or took a lot of time(big csv files or so)
these cases can affect users that they can't get updates from webapp.
Is this the right understanding? @alexamakarov

What is our TTL setting for punctuator queue?
It seems like we use TTL for avoiding duplicates from Allow scaling to multiple updaters #1540 (comment)
Is this sqs-visibility-timeout? and how much time we set for TTL? @alexamakarov

Let's set a plan(how to monitor puctuator and updaters, how to deploy multiple updaters, dynamic provisioning is available?) later in this thread.

xneyder · 2021-06-28T16:05:00Z

xneyder
Jun 28, 2021

Hi @kepricon The health checks are only used in Kubernetes probes to check if the pod is healthy, when the health check fails it restarts the pod.

I think that with the introduction of the punctuator we can already scale up the updaters, but @alexamakarov can correct me if I am wrong.

0 replies

alethander · 2021-06-28T18:14:20Z

alethander
Jun 28, 2021

The whole idea of extracting punctuator is that heavy tasks of getting updates from AWS (csv, errors, etc) are the duty of updater, but updater does it only for a single run. When punctuator keeps track of runs that need to be updated and distribute that workload.
So, basically, if one updater dies, that affects only one single updater circle of one single run.

If the punctuator dies, that should not affect performance much if the punctuator reloads fast. It designed to have single responsibility and event if it crashes in some reason, the task will be re-performed on restart. The task is - to get list of still running jobs from DB and publish their ID to a queue where scaled updaters can pick them from one by one.

As for TTL (or as AWS calls it - retention period) - that is not set per message, that is setting of the AWS Queue, and we have it set to 60 seconds (which is the min value allowed by AWS). These 60 seconds perfectly match the pause time of punctuation, but that is rather a coincidence.

0 replies

kepricon · 2021-06-29T16:31:24Z

kepricon
Jun 29, 2021
Author

Thank you for the reply. @xneyder @alexamakarov
I like our current architecture and managing Punctuator/Updater using Kubernetes Probes and agree that restarting Punctuator will work properly. 👍

As you may have already noticed, we are getting more users and I think we will need scalable updaters in the near future since they will run more jobs and each training is likely to be heavier and more problematic.

I have a couple of questions and thoughts about it .
1. How can we start two Updaters?
@xneyder can you provide a brief explanation or draft PR whichever you prefer?

2. Can we deploy Updaters in a dynamic scaling manner depends on the number of running jobs?
if so, I was wondering if we could set the threshold about scaling.
for example, the number of available jobs for each worker, time period for changing the number of instances, etc.
@xneyder

3. How can we handle problematic jobs?
A problematic job means here that a job can affect other jobs' progress updating badly.
for example, updating progress for a certain job takes too much time like 1 hr(?) or making Updater crash for some reason.
no matter how many Updaters we have, it can be trouble for the entire system.

Let's imagine there are three running jobs(run1, run2, run3) and we have two updaters.
and updating run1 takes 1 hr and others take 30 secs.
If I understand correctly, two updaters will only work on run1 at the same time after some amount of time.
and it will cause users will not get updates from run2 and run3 for an hour and we also need to think about duplicated updates from both the updaters. making Updater crash case is even worse.

to avoid this, we may need to do some work from Punctuator or somewhere else.
(filtering the job that is already processing in Updater or implementing job assigning logic for each updater, etc.)
What do you think about this?@alexamakarov

On a side note, we can integrate monitoring Puctuator/Updaters stuff in a Pathmind-Admin in the future.
Let's talk about it later in a different thread.

2 replies

xneyder Jun 30, 2021

Hi @kepricon To cale the pods we have 2 ways:

Manually:

kubectl scale --replicas=3 deploy/pathmind-updater -n dev

Automatically. This is what you suggested and for that, we use something called HPA from Kubernetes. The challenge here is that scaling is done using a custom metric like the number of jobs, so for that, I need to create a Prometheus metric and scale based on it.

alethander Jun 30, 2021

As far as I understood, the main concern is long-running updates.
To overcome it, I might suggest keeping track of updating jobs.
Let say run1 takes 1 hour to update and occupies updater1 for that task.
Then we write to some cache(DB, Redis, ..) the id of the run, to should that it is under update and still in process,
then we can either filter such runs in punctuator or drop such updater request on updater side(by let say updater2 in our example) if the cache had such runId.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Scalable updaters #3257

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 3 comments 2 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Scalable updaters #3257

kepricon Jun 25, 2021

Replies: 3 comments · 2 replies

xneyder Jun 28, 2021

alethander Jun 28, 2021

kepricon Jun 29, 2021 Author

xneyder Jun 30, 2021

alethander Jun 30, 2021

kepricon
Jun 25, 2021

Replies: 3 comments 2 replies

xneyder
Jun 28, 2021

alethander
Jun 28, 2021

kepricon
Jun 29, 2021
Author