Replies: 3 comments 2 replies
-
Hi @kepricon The health checks are only used in Kubernetes probes to check if the pod is healthy, when the health check fails it restarts the pod. I think that with the introduction of the punctuator we can already scale up the updaters, but @alexamakarov can correct me if I am wrong. |
Beta Was this translation helpful? Give feedback.
-
The whole idea of extracting punctuator is that heavy tasks of getting updates from AWS (csv, errors, etc) are the duty of updater, but updater does it only for a single run. When punctuator keeps track of runs that need to be updated and distribute that workload. If the punctuator dies, that should not affect performance much if the punctuator reloads fast. It designed to have single responsibility and event if it crashes in some reason, the task will be re-performed on restart. The task is - to get list of still running jobs from DB and publish their ID to a queue where scaled updaters can pick them from one by one. As for TTL (or as AWS calls it - retention period) - that is not set per message, that is setting of the AWS Queue, and we have it set to 60 seconds (which is the min value allowed by AWS). These 60 seconds perfectly match the pause time of punctuation, but that is rather a coincidence. |
Beta Was this translation helpful? Give feedback.
-
Thank you for the reply. @xneyder @alexamakarov As you may have already noticed, we are getting more users and I think we will need scalable updaters in the near future since they will run more jobs and each training is likely to be heavier and more problematic. I have a couple of questions and thoughts about it . 2. Can we deploy Updaters in a dynamic scaling manner depends on the number of running jobs? 3. How can we handle problematic jobs? Let's imagine there are three running jobs(run1, run2, run3) and we have two updaters. to avoid this, we may need to do some work from Punctuator or somewhere else. On a side note, we can integrate monitoring Puctuator/Updaters stuff in a Pathmind-Admin in the future. |
Beta Was this translation helpful? Give feedback.
-
Hi @slinlee @alexamakarov @xneyder
I'd like to start to discuss
Scalable Updaters
here.I think we already have the necessary modules for making scalble from ticket, PR, design and It's a good time to understand the current implementation better and discuss how to make it happen.
for starters, I have a couple of questions about our implementations.
We have a health check endpoint for puctuator and updaters from add health endpoint to all running modules #3034.
Have we used this endpoint from somewhere such as pathmind-admin? @xneyder
under the current circumstances(single punctuator and single updater), if puctuator and updaters are dead or stuck for some reason, users can't get updates from webapp.
these cases can affect users that they can't get updates from webapp.
Is this the right understanding? @alexamakarov
It seems like we use TTL for avoiding duplicates from Allow scaling to multiple updaters #1540 (comment)
Is this sqs-visibility-timeout? and how much time we set for TTL? @alexamakarov
Let's set a plan(how to monitor puctuator and updaters, how to deploy multiple updaters, dynamic provisioning is available?) later in this thread.
Beta Was this translation helpful? Give feedback.
All reactions